Thread: Triggers on foreign tables
Hello. Since the last discussion about it (http://www.postgresql.org/message-id/CADyhKSUGP6oJb1pybTiMOP3s5fg_yOkgUTo-7RBcLTNVaJ57Pw@mail.gmail.com),I finally managed to implement row triggers for foreign tables. The attached patch does not contain regression tests or documentation yet, a revised patch will include those sometime during the week. I'll add it to the commitfest then. A simple test scenario using postgres_fdw is however included as an attachment. The attached patch implements triggers for foreign tables according to the design discussed in the link above. For statement-level triggers, nothing has changed since the last patch. Statement-level triggers are just allowed in the command checking. For row-level triggers, it works as follows: - during rewrite phase, every attribute of the foreign table is duplicated as a resjunk entry if a trigger is defined on the relation. These entries are then used to store the old values for a tuple. There is room for improvement here: we could check if any trigger is in fact a row-level trigger - during execution phase, the before triggers are fired exactly like triggers on regular tables, except that old tuples are not fetched using their ctid, but rebuilt using the previously-stored resjunk attributes. - for after triggers, the whole queuing mechanism is bypassed for foreign tables. This is IMO acceptable, since foreign tables cannot have constraints or constraints triggers, and thus have not need for deferrable execution. This design avoids the need for storing and retrieving/identifying remote tuples until the query or transaction end. - the duplicated resjunk attributes are identified by being: - marked as resjunk in the targetlist - not being system or whole-row attributes (varno > 0) There is still one small issue with the attached patch: modifications to the tuple performed by the foreign data wrapper (via the returned TupleTableSlot in ExecForeignUpdate and ExecForeignInsert hooks) are not visible to the AFTER trigger. This could be fixed by merging the planslot containing the resjunk columns with the returned slot before calling the trigger, but I'm not really sure how to safely perform that. Any advice ? Many thanks to Kohei Kaigai for taking the time to help with the design. -- Ronan Dunklau http://dalibo.com - http://dalibo.org
Attachment
Le mardi 7 janvier 2014 12:11:28 Ronan Dunklau a écrit : > > The attached patch does not contain regression tests or documentation yet, a > revised patch will include those sometime during the week. I'll add it to > the commitfest then. Please find attached a new patch which fixes some issues with WHEN conditions and provide a set of regression tests as well as the documentation. The previously mentioned issue regarding values modified by the FDW is however not fixed. -- Ronan Dunklau http://dalibo.com - http://dalibo.org
Attachment
On Tue, Jan 07, 2014 at 12:11:28PM +0100, Ronan Dunklau wrote: > Since the last discussion about it (http://www.postgresql.org/message-id/CADyhKSUGP6oJb1pybTiMOP3s5fg_yOkgUTo-7RBcLTNVaJ57Pw@mail.gmail.com),I > finally managed to implement row triggers for foreign tables. > For statement-level triggers, nothing has changed since the last patch. > Statement-level triggers are just allowed in the command checking. > > For row-level triggers, it works as follows: > > - during rewrite phase, every attribute of the foreign table is duplicated as > a resjunk entry if a trigger is defined on the relation. These entries are > then used to store the old values for a tuple. There is room for improvement > here: we could check if any trigger is in fact a row-level trigger The need here is awfully similar to a need of INSTEAD OF triggers on views. For them, we add a single "wholerow" resjunk TLE. Is there a good reason to do it differently for triggers on foreign tables? > - for after triggers, the whole queuing mechanism is bypassed for foreign > tables. This is IMO acceptable, since foreign tables cannot have constraints > or constraints triggers, and thus have not need for deferrable execution. This > design avoids the need for storing and retrieving/identifying remote tuples > until the query or transaction end. Whether an AFTER ROW trigger is deferred determines whether it runs at the end of the firing query or at the end of the firing query's transaction. In all cases, every BEFORE ROW trigger of a given query fires before any AFTER ROW trigger of the same query. SQL requires that. This proposal would give foreign table AFTER ROW triggers a novel firing time; let's not do that. I think the options going forward are either (a) design a way to queue foreign table AFTER ROW triggers such that we can get the old and/or new rows at the end of the query or (b) not support AFTER ROW triggers on foreign tables for the time being. It's not clear to me whether SQL/MED contemplates triggers on foreign tables. Its <drop basic column definition> General Rules do mention the possibility of a reference from a trigger column list. On the other hand, I see nothing overriding the fact that CREATE TRIGGER only targets base tables. Is this clearer to anyone else? (This is a minor point, mainly bearing on the Compatibility section of the CREATE TRIGGER documentation.) > - the duplicated resjunk attributes are identified by being: > - marked as resjunk in the targetlist > - not being system or whole-row attributes (varno > 0) > > There is still one small issue with the attached patch: modifications to the > tuple performed by the foreign data wrapper (via the returned TupleTableSlot > in ExecForeignUpdate and ExecForeignInsert hooks) are not visible to the AFTER > trigger. This could be fixed by merging the planslot containing the resjunk > columns with the returned slot before calling the trigger, but I'm not really > sure how to safely perform that. Any advice ? Currently, FDWs are permitted to skip returning columns not actually referenced by any RETURNING clause. I would change that part of the API contract to require returning all columns when an AFTER ROW trigger is involved. You can't get around doing that by merging old column values, because, among other reasons, an INSERT does not have those values at all. On Tue, Jan 07, 2014 at 05:31:10PM +0100, Ronan Dunklau wrote: > --- a/contrib/postgres_fdw/expected/postgres_fdw.out > +++ b/contrib/postgres_fdw/expected/postgres_fdw.out > +NOTICE: TG_NAME: trig_row_after > +NOTICE: TG_WHEN: AFTER > +NOTICE: TG_LEVEL: ROW > +NOTICE: TG_OP: DELETE > +NOTICE: TG_RELID::regclass: rem1 > +NOTICE: TG_RELNAME: rem1 > +NOTICE: TG_TABLE_NAME: rem1 > +NOTICE: TG_TABLE_SCHEMA: public > +NOTICE: TG_NARGS: 2 > +NOTICE: TG_ARGV: [23, skidoo] > +NOTICE: OLD: (2,bye) > +NOTICE: TG_NAME: trig_row_before > +NOTICE: TG_WHEN: BEFORE > +NOTICE: TG_LEVEL: ROW > +NOTICE: TG_OP: DELETE > +NOTICE: TG_RELID::regclass: rem1 > +NOTICE: TG_RELNAME: rem1 > +NOTICE: TG_TABLE_NAME: rem1 > +NOTICE: TG_TABLE_SCHEMA: public > +NOTICE: TG_NARGS: 2 > +NOTICE: TG_ARGV: [23, skidoo] > +NOTICE: OLD: (11,"bye remote") > +NOTICE: TG_NAME: trig_row_after > +NOTICE: TG_WHEN: AFTER > +NOTICE: TG_LEVEL: ROW > +NOTICE: TG_OP: DELETE > +NOTICE: TG_RELID::regclass: rem1 > +NOTICE: TG_RELNAME: rem1 > +NOTICE: TG_TABLE_NAME: rem1 > +NOTICE: TG_TABLE_SCHEMA: public > +NOTICE: TG_NARGS: 2 > +NOTICE: TG_ARGV: [23, skidoo] > +NOTICE: OLD: (11,"bye remote") > +insert into rem1 values(1,'insert'); Would you trim the verbosity a bit? Maybe merge several of the TG_ fields onto one line, and remove the low-importance ones. Perhaps issue one line like this in place of all the current TG_ lines: NOTICE: trig_row_after(23, skidoo) AFTER ROW DELETE ON rem1 > --- a/contrib/postgres_fdw/sql/postgres_fdw.sql > +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql > @@ -384,3 +384,188 @@ insert into loc1(f2) values('bye'); > insert into rem1(f2) values('bye remote'); > select * from loc1; > select * from rem1; > + > +-- ==================================================================+-- test local triggers > +-- ==================================================================+ > +-- Trigger functions "borrowed" from triggers regress test. > +CREATE FUNCTION trigger_func() RETURNS trigger LANGUAGE plpgsql AS $$ > +BEGIN > + RAISE NOTICE 'trigger_func(%) called: action = %, when = %, level = %', TG_ARGV[0], TG_OP, TG_WHEN, TG_LEVEL; > + RETURN NULL; > +END;$$; > + > +CREATE TRIGGER trig_stmt_before BEFORE DELETE OR INSERT OR UPDATE ON ft1 FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func(); > +CREATE TRIGGER trig_stmt_after AFTER DELETE OR INSERT OR UPDATE ON ft1 FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func(); No test actually fires these triggers. > --- a/doc/src/sgml/trigger.sgml > +++ b/doc/src/sgml/trigger.sgml > @@ -96,6 +96,7 @@ > after may only be defined at statement level, while triggers that fire > instead of an <command>INSERT</command>, <command>UPDATE</command>, > or <command>DELETE</command> may only be defined at row level. > + On foreign tables, triggers can not be defined at row level. This is obsolete. > --- a/src/backend/commands/tablecmds.c > +++ b/src/backend/commands/tablecmds.c > @@ -3150,13 +3150,16 @@ ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd, > /* No command-specific prep needed */ > break; > case AT_EnableTrig: /* ENABLE TRIGGER variants */ > - case AT_EnableAlwaysTrig: > - case AT_EnableReplicaTrig: > case AT_EnableTrigAll: > case AT_EnableTrigUser: > case AT_DisableTrig: /* DISABLE TRIGGER variants */ > case AT_DisableTrigAll: > case AT_DisableTrigUser: > + ATSimplePermissions(rel, ATT_TABLE | ATT_FOREIGN_TABLE); > + pass = AT_PASS_MISC; > + break; > + case AT_EnableReplicaTrig: > + case AT_EnableAlwaysTrig: Why permit some variants, but not every variant, of ALTER TABLE t ENABLE TRIGGER <type> on foreign tables? > case AT_EnableRule: /* ENABLE/DISABLE RULE variants */ > case AT_EnableAlwaysRule: > case AT_EnableReplicaRule: > diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c > index 4eff184..da3c568 100644 > --- a/src/backend/commands/trigger.c > +++ b/src/backend/commands/trigger.c > @@ -1844,9 +1864,7 @@ ExecCallTriggerFunc(TriggerData *trigdata, > */ > InitFunctionCallInfoData(fcinfo, finfo, 0, > InvalidOid, (Node *) trigdata, NULL); > - > pgstat_init_function_usage(&fcinfo, &fcusage); > - Please don't change unrelated whitespace. > MyTriggerDepth++; > PG_TRY(); > { > @@ -1946,10 +1964,10 @@ ExecASInsertTriggers(EState *estate, ResultRelInfo *relinfo) > > TupleTableSlot * > ExecBRInsertTriggers(EState *estate, ResultRelInfo *relinfo, > - TupleTableSlot *slot) > + TupleTableSlot *tupleslot) > { > TriggerDesc *trigdesc = relinfo->ri_TrigDesc; > - HeapTuple slottuple = ExecMaterializeSlot(slot); > + HeapTuple slottuple = ExecMaterializeSlot(tupleslot); > HeapTuple newtuple = slottuple; > HeapTuple oldtuple; > TriggerData LocTriggerData; > @@ -2003,9 +2021,9 @@ ExecBRInsertTriggers(EState *estate, ResultRelInfo *relinfo, > if (newslot->tts_tupleDescriptor != tupdesc) > ExecSetSlotDescriptor(newslot, tupdesc); > ExecStoreTuple(newtuple, newslot, InvalidBuffer, false); > - slot = newslot; > + tupleslot = newslot; > } > - return slot; > + return tupleslot; Keep the old variable name. Exceptions can be made if the name was deceptive, but "tupleslot" communicates the same thing as "slot". > @@ -2146,7 +2183,8 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo) > bool > ExecBRDeleteTriggers(EState *estate, EPQState *epqstate, > ResultRelInfo *relinfo, > - ItemPointer tupleid) > + ItemPointer tupleid, > + TupleTableSlot *tupleslot) The new argument is unused. > + if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE){ > + TriggerData LocTriggerData; > + trigtuple = ExtractOldTuple(tupleslot, relinfo); > + LocTriggerData.tg_trigtuple = trigtuple; > + LocTriggerData.tg_newtuple = NULL; > + LocTriggerData.tg_trigtuplebuf = InvalidBuffer; > + LocTriggerData.tg_newtuplebuf = InvalidBuffer; > + ExecARForeignTrigger(estate, > + LocTriggerData, > + relinfo, > + TRIGGER_EVENT_DELETE, > + TRIGGER_TYPE_DELETE); > + } > + else > + { > + trigtuple = GetTupleForTrigger(estate, Please use pgindent to fix the formatting of your new code. It's fine to introduce occasional whitespace errors, but they're unusually-plentiful here. > @@ -2604,7 +2704,11 @@ GetTupleForTrigger(EState *estate, > HeapTupleData tuple; > HeapTuple result; > Buffer buffer; > - > + /* > + * Foreign tables tuples are NOT stored on the heap. > + * Instead, old attributes values are stored as resjunk attributes in the > + * original tuple. So, we build a new tuple from those attributes here. > + */ Obsolete comment. That's done elsewhere, not here. > @@ -2731,6 +2835,90 @@ ltrmark:; > } > > /* > + * Get an old tuple from a "mixed tuple", containing both the new values as well > + * as well as the old ones as resjunk columns. > + */ > +static HeapTuple > +ExtractOldTuple(TupleTableSlot *planSlot, > + ResultRelInfo *relinfo) > +{ > + Relation relation = relinfo->ri_RelationDesc; > + TupleDesc base_tuple_desc = relation->rd_att; > + Datum * datum = palloc0(sizeof(Datum) * base_tuple_desc->natts); > + bool * isnull = palloc0(sizeof(bool) * base_tuple_desc->natts); > + HeapTuple tuple; > + /* > + * NewSlot is not null: this indicates a BEFORE trigger. > + */ > + JunkFilter * junk_filter = relinfo->ri_junkFilter; > + List * target_list = junk_filter->jf_targetList; > + ListCell * lc; > + foreach(lc, target_list) > + { > + TargetEntry *tle = (TargetEntry *) lfirst(lc); > + /* > + * Only consider previously added resjunk entries > + */ > + if(tle->resjunk) > + { > + if(IsA(tle->expr, Var)) > + { > + Var * junkvar = (Var * )tle->expr; > + /* > + * Do not copy system identifiers. > + */ > + if(junkvar->varoattno > 0) > + { > + datum[junkvar->varoattno - 1] = slot_getattr(planSlot, > + tle->resno, > + &(isnull[junkvar->varoattno - 1])); > + } For future reference, you mustn't just assume that a resjunk Var is the same resjunk Var you added for this purpose. The target list has many consumers, present and future, so you need to find your resjunk entries more-reliably than this. See other resjunk-adding code for examples. This concern goes away if you borrow the "wholerow" approach from INSTEAD OF triggers. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Thank you very much for this review. > The need here is awfully similar to a need of INSTEAD OF triggers on views. > For them, we add a single "wholerow" resjunk TLE. Is there a good reason to > do it differently for triggers on foreign tables? I wasn't aware of that, it makes for some much cleaner code IMO. Thanks. > > - for after triggers, the whole queuing mechanism is bypassed for foreign > > tables. This is IMO acceptable, since foreign tables cannot have > > constraints or constraints triggers, and thus have not need for > > deferrable execution. This design avoids the need for storing and > > retrieving/identifying remote tuples until the query or transaction end. > > Whether an AFTER ROW trigger is deferred determines whether it runs at the > end of the firing query or at the end of the firing query's transaction. > In all cases, every BEFORE ROW trigger of a given query fires before any > AFTER ROW trigger of the same query. SQL requires that. This proposal > would give foreign table AFTER ROW triggers a novel firing time; let's not > do that. > > I think the options going forward are either (a) design a way to queue > foreign table AFTER ROW triggers such that we can get the old and/or new > rows at the end of the query or (b) not support AFTER ROW triggers on > foreign tables for the time being. > I did not know this was mandated by the standard. The attached patch tries to solve this problem by allocating a tuplestore in the global afterTriggers structure. This tuplestore is used for the whole transaction, and uses only work_mem per transaction. Both old and new tuples are stored in this tuplestore. Some additional bookkeeping is done on the afterTriggers global structure, to keep track of the number of inserted tuples, and the current read pointer position. The tuples are identified by their order of insertion during the transaction. I think this could benefit from some support in the tuplestore API, by allowing arbitrary seek without the need to store more ReadPointers. I initially tried to keep track of them by allocating read pointers on the tuple store, but it turned out to be so expensive that I had to find another way (24bytes per stored tuple, which are not reclaimable until the end of the transaction). What do you think about this approach ? Is there something I missed which would make it not sustainable ? If you prefer, I also have a patch implementing the rest of the changes and keeping the previous behaviour for after triggers. > It's not clear to me whether SQL/MED contemplates triggers on foreign > tables. Its <drop basic column definition> General Rules do mention the > possibility of a reference from a trigger column list. On the other hand, > I see nothing overriding the fact that CREATE TRIGGER only targets base > tables. Is this clearer to anyone else? (This is a minor point, mainly > bearing on the Compatibility section of the CREATE TRIGGER documentation.) I do not have access to the standard specification, any advice regarding specs compliance would be welcomed. > > > - the duplicated resjunk attributes are identified by being: > > - marked as resjunk in the targetlist > > - not being system or whole-row attributes (varno > 0) > > > > There is still one small issue with the attached patch: modifications to > > the tuple performed by the foreign data wrapper (via the returned > > TupleTableSlot in ExecForeignUpdate and ExecForeignInsert hooks) are not > > visible to the AFTER trigger. This could be fixed by merging the planslot > > containing the resjunk columns with the returned slot before calling the > > trigger, but I'm not really sure how to safely perform that. Any advice ? > > Currently, FDWs are permitted to skip returning columns not actually > referenced by any RETURNING clause. I would change that part of the API > contract to require returning all columns when an AFTER ROW trigger is > involved. You can't get around doing that by merging old column values, > because, among other reasons, an INSERT does not have those values at all. I'm not sure this should be part of the API contract: it would make implementing a FDW more complicated than it is now. The attached patch hooks on rewriteTargetListIU to add the missing targets to the returning clause, when needed. This also changes the way the query's hasReturning flag is set to exclude the case when only resjunk entries are present in the returning list. > > +NOTICE: TG_NARGS: 2 > > +NOTICE: TG_ARGV: [23, skidoo] > > +NOTICE: OLD: (11,"bye remote") > > +insert into rem1 values(1,'insert'); > > Would you trim the verbosity a bit? Maybe merge several of the TG_ fields > onto one line, and remove the low-importance ones. Perhaps issue one line > like this in place of all the current TG_ lines: > > NOTICE: trig_row_after(23, skidoo) AFTER ROW DELETE ON rem1 > Fixed in the attached patch. > > +CREATE TRIGGER trig_stmt_before BEFORE DELETE OR INSERT OR UPDATE ON ft1 > > FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func(); +CREATE TRIGGER > > trig_stmt_after AFTER DELETE OR INSERT OR UPDATE ON ft1 FOR EACH > > STATEMENT EXECUTE PROCEDURE trigger_func(); > No test actually fires these triggers. These should have been placed on the rem1 foreign table, not ft1. Fixed. > > + On foreign tables, triggers can not be defined at row level. > > This is obsolete. I missed that from the earlier version of the patch, thank you. > Why permit some variants, but not every variant, of ALTER TABLE t ENABLE > TRIGGER <type> on foreign tables? I overlooked that. Fixed. > Keep the old variable name. Exceptions can be made if the name was > deceptive, but "tupleslot" communicates the same thing as "slot". Fixed. > > > @@ -2146,7 +2183,8 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo > > *relinfo)> > > bool > > ExecBRDeleteTriggers(EState *estate, EPQState *epqstate, > > > > ResultRelInfo *relinfo, > > > > - ItemPointer tupleid) > > + ItemPointer tupleid, > > + TupleTableSlot *tupleslot) > > The new argument is unused. I added it here for coherence. Now removed in the attached patch. > > Please don't change unrelated whitespace. > Please use pgindent to fix the formatting of your new code. It's fine to > introduce occasional whitespace errors, but they're unusually-plentiful > here. I think its done now. One problem I have with running pgindent is that I accidentally add chunks that were modified only by pgindent. > Obsolete comment. That's done elsewhere, not here. Ok > > For future reference, you mustn't just assume that a resjunk Var is the same > resjunk Var you added for this purpose. The target list has many > consumers, present and future, so you need to find your resjunk entries > more-reliably than this. See other resjunk-adding code for examples. This > concern goes away if you borrow the "wholerow" approach from INSTEAD OF > triggers. Using the wholerow approach, the entry is identified by the junkfilter jf_junkAttNo attribute. So this concern indeed goes away. Again, thank you for this review.
Attachment
Hello, > I initially tried to keep track of them by allocating read pointers on the > tuple store, but it turned out to be so expensive that I had to find another > way (24bytes per stored tuple, which are not reclaimable until the end of > the transaction). > > What do you think about this approach ? Is there something I missed which > would make it not sustainable ? > It seems to me reasonable approach to track them. Just a corner case, it may cause an unexpected problem if someone tried to update a foreign table with 2^31 of tuples because of int index. It may make sense to put a check fdw_nextwrite is less than INT_MAX. :-) I have a few other minor comments: +static HeapTuple +ExtractOldTuple(TupleTableSlot *planSlot, + ResultRelInfo *relinfo) +{ + bool isNull; + JunkFilter *junkfilter = relinfo->ri_junkFilter; + HeapTuple oldtuple = palloc0(sizeof(HeapTupleData)); + HeapTupleHeader td; + Datum datum = ExecGetJunkAttribute(planSlot, + junkfilter->jf_junkAttNo, + &isNull); + + /* shouldn't ever get a null result... */ + if (isNull) + elog(ERROR, "wholerow is NULL"); + td = DatumGetHeapTupleHeader(datum); + (*oldtuple).t_len = HeapTupleHeaderGetDatumLength(td); + (*oldtuple).t_data = td; + return oldtuple; +} + Why not usual coding manner as: oldtuple->t_len = HeapTupleHeaderGetDatumLength(td); oldtuple->t_data = td; Also, it don't put tableOid on the tuple. oldtuple->t_tableOid = RelationGetRelid(relinfo->ri_RelationDesc); @@ -730,6 +738,45 @@ rewriteTargetListIU(Query *parsetree, Relation target_relation, + /* + * For foreign tables, build a similar array for returning tlist. + */ + if (need_full_returning_tlist) + { + returning_tles = (TargetEntry **) palloc0(numattrs * sizeof(TargetEntry *)); + foreach(temp, parsetree->returningList) + { + TargetEntry *old_rtle = (TargetEntry *) lfirst(temp); + + if (IsA(old_rtle->expr, Var)) + { + Var *var = (Var *) old_rtle->expr; + + if (var->varno == parsetree->resultRelation) + { + attrno = var->varattno; + if (attrno < 1 || attrno > numattrs) + elog(ERROR, "bogus resno %d in targetlist", attrno); This checks caused an error when returning list contains a reference to system column; that has negative attribute number. Probably, it should be "continue;", instead of elog(). BTW, isn't it sufficient to inhibit optimization by putting whole-row-reference here, rather than whole-row-reference. Postgres_fdw extracts whole-row-reference into individual columns reference. + att_tup = target_relation->rd_att->attrs[attrno - 1]; + /* We can (and must) ignore deleted attributes */ + if (att_tup->attisdropped) + continue; + returning_tles[attrno - 1] = old_rtle; + } + } + } + } + Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ronan Dunklau > Sent: Thursday, January 23, 2014 11:18 PM > To: Noah Misch > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Triggers on foreign tables > > Thank you very much for this review. > > > The need here is awfully similar to a need of INSTEAD OF triggers on views. > > For them, we add a single "wholerow" resjunk TLE. Is there a good > > reason to do it differently for triggers on foreign tables? > > I wasn't aware of that, it makes for some much cleaner code IMO. Thanks. > > > > - for after triggers, the whole queuing mechanism is bypassed for > > > foreign tables. This is IMO acceptable, since foreign tables cannot > > > have constraints or constraints triggers, and thus have not need for > > > deferrable execution. This design avoids the need for storing and > > > retrieving/identifying remote tuples until the query or transaction > end. > > > > Whether an AFTER ROW trigger is deferred determines whether it runs at > > the end of the firing query or at the end of the firing query's transaction. > > In all cases, every BEFORE ROW trigger of a given query fires before > > any AFTER ROW trigger of the same query. SQL requires that. This > > proposal would give foreign table AFTER ROW triggers a novel firing > > time; let's not do that. > > > > I think the options going forward are either (a) design a way to queue > > foreign table AFTER ROW triggers such that we can get the old and/or > > new rows at the end of the query or (b) not support AFTER ROW triggers > > on foreign tables for the time being. > > > > I did not know this was mandated by the standard. > > The attached patch tries to solve this problem by allocating a tuplestore > in the global afterTriggers structure. This tuplestore is used for the whole > transaction, and uses only work_mem per transaction. > > Both old and new tuples are stored in this tuplestore. Some additional > bookkeeping is done on the afterTriggers global structure, to keep track > of the number of inserted tuples, and the current read pointer position. > The tuples are identified by their order of insertion during the > transaction. > I think this could benefit from some support in the tuplestore API, by > allowing arbitrary seek without the need to store more ReadPointers. > > I initially tried to keep track of them by allocating read pointers on the > tuple store, but it turned out to be so expensive that I had to find another > way (24bytes per stored tuple, which are not reclaimable until the end of > the transaction). > > What do you think about this approach ? Is there something I missed which > would make it not sustainable ? > > If you prefer, I also have a patch implementing the rest of the changes > and keeping the previous behaviour for after triggers. > > > It's not clear to me whether SQL/MED contemplates triggers on foreign > > tables. Its <drop basic column definition> General Rules do mention > > the possibility of a reference from a trigger column list. On the > > other hand, I see nothing overriding the fact that CREATE TRIGGER only > > targets base tables. Is this clearer to anyone else? (This is a > > minor point, mainly bearing on the Compatibility section of the CREATE > > TRIGGER documentation.) > > I do not have access to the standard specification, any advice regarding > specs compliance would be welcomed. > > > > > > - the duplicated resjunk attributes are identified by being: > > > - marked as resjunk in the targetlist > > > - not being system or whole-row attributes (varno > 0) > > > > > > There is still one small issue with the attached patch: > > > modifications to the tuple performed by the foreign data wrapper > > > (via the returned TupleTableSlot in ExecForeignUpdate and > > > ExecForeignInsert hooks) are not visible to the AFTER trigger. This > > > could be fixed by merging the planslot containing the resjunk > > > columns with the returned slot before calling the trigger, but I'm not > really sure how to safely perform that. Any advice ? > > > > Currently, FDWs are permitted to skip returning columns not actually > > referenced by any RETURNING clause. I would change that part of the > > API contract to require returning all columns when an AFTER ROW > > trigger is involved. You can't get around doing that by merging old > > column values, because, among other reasons, an INSERT does not have those > values at all. > > I'm not sure this should be part of the API contract: it would make > implementing a FDW more complicated than it is now. The attached patch hooks > on rewriteTargetListIU to add the missing targets to the returning clause, > when needed. > > This also changes the way the query's hasReturning flag is set to exclude > the case when only resjunk entries are present in the returning list. > > > > +NOTICE: TG_NARGS: 2 > > > +NOTICE: TG_ARGV: [23, skidoo] > > > +NOTICE: OLD: (11,"bye remote") > > > +insert into rem1 values(1,'insert'); > > > > Would you trim the verbosity a bit? Maybe merge several of the TG_ > > fields onto one line, and remove the low-importance ones. Perhaps > > issue one line like this in place of all the current TG_ lines: > > > > NOTICE: trig_row_after(23, skidoo) AFTER ROW DELETE ON rem1 > > > > Fixed in the attached patch. > > > > > +CREATE TRIGGER trig_stmt_before BEFORE DELETE OR INSERT OR UPDATE > > > +ON ft1 > > > FOR EACH STATEMENT EXECUTE PROCEDURE trigger_func(); +CREATE TRIGGER > > > trig_stmt_after AFTER DELETE OR INSERT OR UPDATE ON ft1 FOR EACH > > > STATEMENT EXECUTE PROCEDURE trigger_func(); > > No test actually fires these triggers. > > These should have been placed on the rem1 foreign table, not ft1. Fixed. > > > > + On foreign tables, triggers can not be defined at row level. > > > > This is obsolete. > > I missed that from the earlier version of the patch, thank you. > > > > Why permit some variants, but not every variant, of ALTER TABLE t > > ENABLE TRIGGER <type> on foreign tables? > > I overlooked that. Fixed. > > > > Keep the old variable name. Exceptions can be made if the name was > > deceptive, but "tupleslot" communicates the same thing as "slot". > > Fixed. > > > > > > @@ -2146,7 +2183,8 @@ ExecASDeleteTriggers(EState *estate, > > > ResultRelInfo *relinfo)> bool ExecBRDeleteTriggers(EState *estate, > > > EPQState *epqstate, > > > > > > ResultRelInfo *relinfo, > > > > > > - ItemPointer tupleid) > > > + ItemPointer tupleid, > > > + TupleTableSlot *tupleslot) > > > > The new argument is unused. > > I added it here for coherence. Now removed in the attached patch. > > > > > Please don't change unrelated whitespace. > > > Please use pgindent to fix the formatting of your new code. It's fine > > to introduce occasional whitespace errors, but they're > > unusually-plentiful here. > > I think its done now. One problem I have with running pgindent is that I > accidentally add chunks that were modified only by pgindent. > > > Obsolete comment. That's done elsewhere, not here. > > Ok > > > > > For future reference, you mustn't just assume that a resjunk Var is > > the same resjunk Var you added for this purpose. The target list has > > many consumers, present and future, so you need to find your resjunk > > entries more-reliably than this. See other resjunk-adding code for > > examples. This concern goes away if you borrow the "wholerow" > > approach from INSTEAD OF triggers. > > Using the wholerow approach, the entry is identified by the junkfilter > jf_junkAttNo attribute. So this concern indeed goes away. > > > Again, thank you for this review.
Thank you for taking the time to review this. Please find attached a new version of the patch. Le mercredi 29 janvier 2014 09:13:36 Kouhei Kaigai a écrit : > It may make sense to put a check fdw_nextwrite is less than INT_MAX. :-) The attached patch checks this, and add documentation for this limitation. I'm not really sure about how to phrase that correctly in the error message and the documentation. One can store at most INT_MAX foreign tuples, which means that at most INT_MAX insert or delete or "half-updates" can occur. By half-updates, I mean that for update two tuples are stored whereas in contrast to only one for insert and delete trigger. Besides, I don't know where this disclaimer should be in the documentation. Any advice here ? > Why not usual coding manner as: > oldtuple->t_len = HeapTupleHeaderGetDatumLength(td); > oldtuple->t_data = td; > > Also, it don't put tableOid on the tuple. > oldtuple->t_tableOid = RelationGetRelid(relinfo->ri_RelationDesc); Fixed, thank you. > > @@ -730,6 +738,45 @@ rewriteTargetListIU(Query *parsetree, Relation > target_relation, + /* > + * For foreign tables, build a similar array for returning tlist. > + */ > + if (need_full_returning_tlist) > + { > + returning_tles = (TargetEntry **) palloc0(numattrs * > sizeof(TargetEntry *)); + foreach(temp, parsetree->returningList) > + { > + TargetEntry *old_rtle = (TargetEntry *) lfirst(temp); > + > + if (IsA(old_rtle->expr, Var)) > + { > + Var *var = (Var *) old_rtle->expr; > + > + if (var->varno == parsetree->resultRelation) > + { > + attrno = var->varattno; > + if (attrno < 1 || attrno > numattrs) > + elog(ERROR, "bogus resno %d in targetlist", attrno); > > This checks caused an error when returning list contains a reference to > system column; that has negative attribute number. > Probably, it should be "continue;", instead of elog(). Are system attributes supposed to be accessible for foreign tables? I think they only make sense for postgres_fdw. Anyway, I fixed this and added a test case returning ctid, xmin and xmax. > > BTW, isn't it sufficient to inhibit optimization by putting > whole-row-reference here, rather than whole-row-reference. Postgres_fdw > extracts whole-row-reference into individual columns reference. The code is more straightforward with a whole-row reference. Done in the attached patch. -- Ronan Dunklau http://dalibo.com - http://dalibo.org
Attachment
On Thu, Jan 23, 2014 at 03:17:35PM +0100, Ronan Dunklau wrote: > > > - for after triggers, the whole queuing mechanism is bypassed for foreign > > > tables. This is IMO acceptable, since foreign tables cannot have > > > constraints or constraints triggers, and thus have not need for > > > deferrable execution. This design avoids the need for storing and > > > retrieving/identifying remote tuples until the query or transaction end. > > > > Whether an AFTER ROW trigger is deferred determines whether it runs at the > > end of the firing query or at the end of the firing query's transaction. > > In all cases, every BEFORE ROW trigger of a given query fires before any > > AFTER ROW trigger of the same query. SQL requires that. This proposal > > would give foreign table AFTER ROW triggers a novel firing time; let's not > > do that. > > > > I think the options going forward are either (a) design a way to queue > > foreign table AFTER ROW triggers such that we can get the old and/or new > > rows at the end of the query or (b) not support AFTER ROW triggers on > > foreign tables for the time being. > > > > I did not know this was mandated by the standard. > > The attached patch tries to solve this problem by allocating a tuplestore > in the global afterTriggers structure. This tuplestore is used for the whole > transaction, and uses only work_mem per transaction. > > Both old and new tuples are stored in this tuplestore. Some additional > bookkeeping is done on the afterTriggers global structure, to keep track of > the number of inserted tuples, and the current read pointer position. The > tuples are identified by their order of insertion during the transaction. > I think this could benefit from some support in the tuplestore API, by > allowing arbitrary seek without the need to store more ReadPointers. > > I initially tried to keep track of them by allocating read pointers on the > tuple store, but it turned out to be so expensive that I had to find another > way (24bytes per stored tuple, which are not reclaimable until the end of the > transaction). > > What do you think about this approach ? Is there something I missed which > would make it not sustainable ? Seems basically reasonable. I foresee multiple advantages from having one tuplestore per query level as opposed to one for the entire transaction. You would remove the performance trap of backing up the tuplestore by rescanning. It permits reclaiming memory and disk space in AfterTriggerEndQuery() rather than at end of transaction. You could remove ate_ptr1 and ate_ptr2 from AfterTriggerEventDataFDW and just store the flags word: depending on AFTER_TRIGGER_2CTIDS, grab either the next one or the next two tuples from the tuplestore. Using work_mem per AfterTriggerBeginQuery() instead of per transaction is no problem. What do you think of that design change? If you do pursue that change, make sure the code still does the right thing when it drops queued entries during subxact abort. > I do not have access to the standard specification, any advice regarding > specs compliance would be welcomed. http://wiki.postgresql.org/wiki/Developer_FAQ#Where_can_I_get_a_copy_of_the_SQL_standards.3F > > > There is still one small issue with the attached patch: modifications to > > > the tuple performed by the foreign data wrapper (via the returned > > > TupleTableSlot in ExecForeignUpdate and ExecForeignInsert hooks) are not > > > visible to the AFTER trigger. This could be fixed by merging the planslot > > > containing the resjunk columns with the returned slot before calling the > > > trigger, but I'm not really sure how to safely perform that. Any advice ? > > > > Currently, FDWs are permitted to skip returning columns not actually > > referenced by any RETURNING clause. I would change that part of the API > > contract to require returning all columns when an AFTER ROW trigger is > > involved. You can't get around doing that by merging old column values, > > because, among other reasons, an INSERT does not have those values at all. > > I'm not sure this should be part of the API contract: it would make > implementing a FDW more complicated than it is now. The attached patch hooks > on rewriteTargetListIU to add the missing targets to the returning clause, > when needed. You're effectively faking the presence of a RETURNING list so today's conforming FDWs will already do the right thing? Clever. > > Please don't change unrelated whitespace. > > > Please use pgindent to fix the formatting of your new code. It's fine to > > introduce occasional whitespace errors, but they're unusually-plentiful > > here. > > I think its done now. One problem I have with running pgindent is that I > accidentally add chunks that were modified only by pgindent. Yep; that is a pain. Note that pgindent can't fix many unrelated whitespace changes. For example, if you add or remove a blank line, pgindent won't interfere. (We would not want it to interfere, because the use of blank lines is up to the code author.) You will still need to read through your diff for such things. On Wed, Jan 29, 2014 at 12:44:16PM +0100, Ronan Dunklau wrote: > Le mercredi 29 janvier 2014 09:13:36 Kouhei Kaigai a écrit : > > It may make sense to put a check fdw_nextwrite is less than INT_MAX. :-) > > The attached patch checks this, and add documentation for this limitation. > I'm not really sure about how to phrase that correctly in the error message > and the documentation. One can store at most INT_MAX foreign tuples, which > means that at most INT_MAX insert or delete or "half-updates" can occur. By > half-updates, I mean that for update two tuples are stored whereas in contrast > to only one for insert and delete trigger. > > Besides, I don't know where this disclaimer should be in the documentation. > Any advice here ? I wouldn't mention that limitation. > @@ -3390,34 +3497,72 @@ AfterTriggerExecute(AfterTriggerEvent event, > /* > * Fetch the required tuple(s). > */ > - if (ItemPointerIsValid(&(event->ate_ctid1))) > + if (event->ate_flags & AFTER_TRIGGER_FDW) > { > - ItemPointerCopy(&(event->ate_ctid1), &(tuple1.t_self)); > - if (!heap_fetch(rel, SnapshotAny, &tuple1, &buffer1, false, NULL)) > + AfterTriggerEventDataFDW *fdwevent = (AfterTriggerEventDataFDW *) event; > + > + if (afterTriggers->fdw_lastread > fdwevent->ate_ptr1) > + { > + tuplestore_rescan(afterTriggers->fdwtuplestore); > + afterTriggers->fdw_lastread = 0; > + } > + while (afterTriggers->fdw_lastread < fdwevent->ate_ptr1) > + { > + tuplestore_advance(afterTriggers->fdwtuplestore, true); > + afterTriggers->fdw_lastread++; > + } This is the performance trap I mentioned above; it brings potential O(n^2) complexity to certain AFTER trigger execution scenarios. > @@ -3709,6 +3862,9 @@ AfterTriggerBeginXact(void) > afterTriggers->depth_stack = NULL; > afterTriggers->firing_stack = NULL; > afterTriggers->maxtransdepth = 0; > + afterTriggers->fdwtuplestore = tuplestore_begin_heap(true, false, work_mem); Probably best to create the tuplestore lazily, similar to how we initialize afterTriggers->event_cxt. tuplestore_begin_heap() is almost cheap enough to call unconditionally, but relatively few queries will use this. > @@ -986,7 +986,18 @@ ExecModifyTable(ModifyTableState *node) > } > else if (relkind == RELKIND_FOREIGN_TABLE) > { > - /* do nothing; FDW must fetch any junk attrs it wants */ > + /* > + * If the junkAttNo is valid, then it identifies the > + * wholerow attribute. This is the case when there is an > + * UPDATE or DELETE trigger. > + */ > + if (AttributeNumberIsValid(junkfilter->jf_junkAttNo)) > + { > + datum = ExecGetJunkAttribute(slot, > + junkfilter->jf_junkAttNo, > + &isNull); > + oldtuple = DatumGetHeapTupleHeader(datum); Check "isNull", just in case. See similar code elsewhere in this file. > + } > } > else > { > @@ -1334,7 +1345,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags) > } > else if (relkind == RELKIND_FOREIGN_TABLE) > { > - /* FDW must fetch any junk attrs it wants */ > + /* > + * FDW must fetch any junk attrs it wants When there > + * is an AFTER trigger, there should be a wholerow > + * attribute. This comment edit looks half-done. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Thank you for this new review, please find attached a revised patch. Le jeudi 30 janvier 2014 14:05:08 Noah Misch a écrit : > On Thu, Jan 23, 2014 at 03:17:35PM +0100, Ronan Dunklau wrote: > > What do you think about this approach ? Is there something I missed which > > would make it not sustainable ? > > Seems basically reasonable. I foresee multiple advantages from having one > tuplestore per query level as opposed to one for the entire transaction. > You would remove the performance trap of backing up the tuplestore by > rescanning. It permits reclaiming memory and disk space in > AfterTriggerEndQuery() rather than at end of transaction. You could remove > ate_ptr1 and ate_ptr2 from AfterTriggerEventDataFDW and just store the > flags word: depending on AFTER_TRIGGER_2CTIDS, grab either the next one or > the next two tuples from the tuplestore. Using work_mem per > AfterTriggerBeginQuery() instead of per transaction is no problem. What do > you think of that design change? > I agree that this design is better, but I have some objections. We can remove ate_ptr2 and rely on the AFTER_TRIGGER_2CTIDS flag, but the rescanning and ate_ptr1 (renamed ate_tupleindex in the attached patch) can't go away. Consider for example the case of a foreign table with more than one AFTER UPDATE triggers. Unless we store the tuples once for each trigger, we will have to rescan the tuplestore. To mitigate the effects of this behaviour, I added the option to perform a reverse_seek when the looked-up tuple is nearer from the current index than from the start. > Probably best to create the tuplestore lazily, similar to how we initialize > afterTriggers->event_cxt. tuplestore_begin_heap() is almost cheap enough to > call unconditionally, but relatively few queries will use this. The per-query tuplestore design stores one pointer per FDWTuplestore struct. This struct is only used to keep track of the current READ and WRITE pointer positions along with the tuplestore itself. The actual structure will be initialized only if the query needs it. > If you do pursue that change, make sure the code still does the right thing > when it drops queued entries during subxact abort. I don't really understand what should be done at that stage. Since triggers on foreign tables are not allowed to be deferred, everything should be cleaned up at the end of each query, right ? So, there shouldn't be any queued entries. > > > I do not have access to the standard specification, any advice regarding > > specs compliance would be welcomed. > > http://wiki.postgresql.org/wiki/Developer_FAQ#Where_can_I_get_a_copy_of_the_ > SQL_standards.3F Thank you. I did not find anything more than what you noted. I think that even if the bit about foreign tables being mentioned from a column list, and the fact that a DROP TABLE should not create a new trigger execution context are confusing, the fact that the CREATE TRIGGER definition explicitly mentions either base tables or views makes me think that foreign tables are not considered. > You're effectively faking the presence of a RETURNING list so today's > conforming FDWs will already do the right thing? Clever. Yes, that's it. I hope I didn't introduce any side effects with regards to the meaning of the hasReturning flag. > Note that pgindent can't fix many unrelated whitespace changes. For > example, if you add or remove a blank line, pgindent won't interfere. (We > would not want it to interfere, because the use of blank lines is up to the > code author.) You will still need to read through your diff for such > things. Ok. > > The attached patch checks this, and add documentation for this limitation. > > I'm not really sure about how to phrase that correctly in the error > > message > > and the documentation. One can store at most INT_MAX foreign tuples, which > > means that at most INT_MAX insert or delete or "half-updates" can occur. > > By > > half-updates, I mean that for update two tuples are stored whereas in > > contrast to only one for insert and delete trigger. > > > > Besides, I don't know where this disclaimer should be in the > > documentation. > > Any advice here ? > > I wouldn't mention that limitation. Maybe it shouldn't be so prominent, but I still think a note somewhere couldn't hurt. Should the use of work_mem be documented somewhere, too ? > This is the performance trap I mentioned above; it brings potential O(n^2) > complexity to certain AFTER trigger execution scenarios. What scenarios do you have in mind ? I "fixed" the problem when there are multiple triggers on a foreign table, do you have any other one ? > > Check "isNull", just in case. See similar code elsewhere in this file. Ok. > This comment edit looks half-done. Ok. -- Ronan Dunklau http://dalibo.com - http://dalibo.org
Attachment
On Sun, Feb 02, 2014 at 11:53:51AM +0100, Ronan Dunklau wrote: > Le jeudi 30 janvier 2014 14:05:08 Noah Misch a écrit : > > On Thu, Jan 23, 2014 at 03:17:35PM +0100, Ronan Dunklau wrote: > > > What do you think about this approach ? Is there something I missed which > > > would make it not sustainable ? > > > > Seems basically reasonable. I foresee multiple advantages from having one > > tuplestore per query level as opposed to one for the entire transaction. > > You would remove the performance trap of backing up the tuplestore by > > rescanning. It permits reclaiming memory and disk space in > > AfterTriggerEndQuery() rather than at end of transaction. You could remove > > ate_ptr1 and ate_ptr2 from AfterTriggerEventDataFDW and just store the > > flags word: depending on AFTER_TRIGGER_2CTIDS, grab either the next one or > > the next two tuples from the tuplestore. Using work_mem per > > AfterTriggerBeginQuery() instead of per transaction is no problem. What do > > you think of that design change? > > > > I agree that this design is better, but I have some objections. > > We can remove ate_ptr2 and rely on the AFTER_TRIGGER_2CTIDS flag, but the > rescanning and ate_ptr1 (renamed ate_tupleindex in the attached patch) can't > go away. > > Consider for example the case of a foreign table with more than one AFTER > UPDATE triggers. Unless we store the tuples once for each trigger, we will > have to rescan the tuplestore. Will we? Within a given query level, when do (non-deferred) triggers execute in an order other than the enqueue order? > To mitigate the effects of this behaviour, I added the option to perform a > reverse_seek when the looked-up tuple is nearer from the current index than > from the start. If there's still a need to seek within the tuplestore, that should get rid of the O(n^2) effect. I'm hoping that per-query-level tuplestores will eliminate the need to seek entirely. > > If you do pursue that change, make sure the code still does the right thing > > when it drops queued entries during subxact abort. > > I don't really understand what should be done at that stage. Since triggers on > foreign tables are not allowed to be deferred, everything should be cleaned up > at the end of each query, right ? So, there shouldn't be any queued entries. I suspect that's right. If you haven't looked over AfterTriggerEndSubXact(), please do so and ensure all its actions still make sense in the context of this new kind of trigger storage. > > > The attached patch checks this, and add documentation for this limitation. > > > I'm not really sure about how to phrase that correctly in the error > > > message > > > and the documentation. One can store at most INT_MAX foreign tuples, which > > > means that at most INT_MAX insert or delete or "half-updates" can occur. > > > By > > > half-updates, I mean that for update two tuples are stored whereas in > > > contrast to only one for insert and delete trigger. > > > > > > Besides, I don't know where this disclaimer should be in the > > > documentation. > > > Any advice here ? > > > > I wouldn't mention that limitation. > > Maybe it shouldn't be so prominent, but I still think a note somewhere > couldn't hurt. Perhaps. There's not much documentation of such implementation upper limits, and there's no usage of "INT_MAX" outside of parts that discuss writing C code. I'm not much of a visionary when it comes to the documentation; I try to document new things with an amount of detail similar to old features. > Should the use of work_mem be documented somewhere, too ? I wouldn't. Most uses of work_mem are undocumented, even relatively major ones like count(DISTINCT ...) and CTEs. So, while I'd generally favor a patch documenting all/most of the things that use work_mem, it would be odd to document one new consumer apart from the others. > > This is the performance trap I mentioned above; it brings potential O(n^2) > > complexity to certain AFTER trigger execution scenarios. > > What scenarios do you have in mind ? I "fixed" the problem when there are > multiple triggers on a foreign table, do you have any other one ? I'm not aware of such a performance trap in your latest design. For what it's worth, I don't think multiple triggers were ever a problem. In the previous design, a problem arose if you had a scenario like this: Query level 1: queue one million events ... Repeat this section many times: Query level 2: queue one event Query level 3: queue one event Query level 3: execute events Query level 2: execute events <-- had to advance through the 1M stored events ... Query level 1: execute events Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Le lundi 3 février 2014 23:28:45 Noah Misch a écrit : > On Sun, Feb 02, 2014 at 11:53:51AM +0100, Ronan Dunklau wrote: > > Le jeudi 30 janvier 2014 14:05:08 Noah Misch a écrit : > > > On Thu, Jan 23, 2014 at 03:17:35PM +0100, Ronan Dunklau wrote: > > > > What do you think about this approach ? Is there something I missed > > > > which > > > > would make it not sustainable ? > > > > > > Seems basically reasonable. I foresee multiple advantages from having > > > one > > > tuplestore per query level as opposed to one for the entire transaction. > > > You would remove the performance trap of backing up the tuplestore by > > > rescanning. It permits reclaiming memory and disk space in > > > AfterTriggerEndQuery() rather than at end of transaction. You could > > > remove > > > ate_ptr1 and ate_ptr2 from AfterTriggerEventDataFDW and just store the > > > flags word: depending on AFTER_TRIGGER_2CTIDS, grab either the next one > > > or > > > the next two tuples from the tuplestore. Using work_mem per > > > AfterTriggerBeginQuery() instead of per transaction is no problem. What > > > do > > > you think of that design change? > > > > I agree that this design is better, but I have some objections. > > > > We can remove ate_ptr2 and rely on the AFTER_TRIGGER_2CTIDS flag, but the > > rescanning and ate_ptr1 (renamed ate_tupleindex in the attached patch) > > can't go away. > > > > Consider for example the case of a foreign table with more than one AFTER > > UPDATE triggers. Unless we store the tuples once for each trigger, we will > > have to rescan the tuplestore. > > Will we? Within a given query level, when do (non-deferred) triggers > execute in an order other than the enqueue order? Let me explain what I had in mind. Looking at the code in AfterTriggerSaveEvent: - we build a "template" AfterTriggerEvent, and store the tuple(s) - for each suitable after trigger that matches the trigger type, as well as the WHEN condition if any, a copy of the previously built AfterTriggerEvent is queued Later, those events are fired in order. This means that more than one event can be fired for one tuple. Take this example: CREATE TRIGGER trig_row_after1 AFTER UPDATE ON rem2 FOR EACH ROW WHEN (NEW.f1 % 5 < 3) EXECUTE PROCEDURE trigger_func('TRIG1'); CREATE TRIGGER trig_row_after2 AFTER UPDATE ON rem2 FOR EACH ROW WHEN (NEW.f1 % 5 < 4) EXECUTE PROCEDURE trigger_func('TRIG2'); UPDATE rem2 set f2 = 'something'; Assuming 5 rows with f1 as a serial, the fired AfterTriggerEvent's ate_tupleindex will be, in that order. Ass 0-0-2-2-4-8-8 So, at least a backward seek is required for trig_row_after2 to be able to retrieve a tuple that was already consumed when firing trig_row_after1. On a side note, this made me realize that it is better to avoid storing a tuple entirely if there is no enabled trigger (the f1 = 4 case above). The attached patch does that, so the previous sequence becomes: 0-0-2-2-4-6-6 It also prevents from initalizing a tuplestore at all if its not needed. > > > To mitigate the effects of this behaviour, I added the option to perform a > > reverse_seek when the looked-up tuple is nearer from the current index > > than > > from the start. > > If there's still a need to seek within the tuplestore, that should get rid > of the O(n^2) effect. I'm hoping that per-query-level tuplestores will > eliminate the need to seek entirely. I think the only case when seeking is still needed is when there are more than one after trigger that need to be fired, since the abovementioned change prevents from seeking to skip tuples. > > > > If you do pursue that change, make sure the code still does the right > > > thing > > > when it drops queued entries during subxact abort. > > > > I don't really understand what should be done at that stage. Since > > triggers on foreign tables are not allowed to be deferred, everything > > should be cleaned up at the end of each query, right ? So, there > > shouldn't be any queued entries. > I suspect that's right. If you haven't looked over > AfterTriggerEndSubXact(), please do so and ensure all its actions still > make sense in the context of this new kind of trigger storage. You're right, I missed something here. When aborting a subxact, the tuplestores for queries below the subxact query depth should be cleaned, if any, because AfterTriggerEndQuery has not been called for the failing query. The attached patch fixes that. > > > > > The attached patch checks this, and add documentation for this > > > > limitation. > > > > I'm not really sure about how to phrase that correctly in the error > > > > message > > > > and the documentation. One can store at most INT_MAX foreign tuples, > > > > which > > > > means that at most INT_MAX insert or delete or "half-updates" can > > > > occur. > > > > By > > > > half-updates, I mean that for update two tuples are stored whereas in > > > > contrast to only one for insert and delete trigger. > > > > > > > > Besides, I don't know where this disclaimer should be in the > > > > documentation. > > > > Any advice here ? > > > > > > I wouldn't mention that limitation. > > > > Maybe it shouldn't be so prominent, but I still think a note somewhere > > couldn't hurt. > > Perhaps. There's not much documentation of such implementation upper > limits, and there's no usage of "INT_MAX" outside of parts that discuss > writing C code. I'm not much of a visionary when it comes to the > documentation; I try to document new things with an amount of detail > similar to old features. Ok, I removed the paragraph documenting the limitation. > > Should the use of work_mem be documented somewhere, too ? > > I wouldn't. Most uses of work_mem are undocumented, even relatively major > ones like count(DISTINCT ...) and CTEs. So, while I'd generally favor a > patch documenting all/most of the things that use work_mem, it would be odd > to document one new consumer apart from the others. Ok. > > > > This is the performance trap I mentioned above; it brings potential > > > O(n^2) > > > complexity to certain AFTER trigger execution scenarios. > > > > What scenarios do you have in mind ? I "fixed" the problem when there are > > multiple triggers on a foreign table, do you have any other one ? > > I'm not aware of such a performance trap in your latest design. Good ! > Thanks, > nm -- Ronan Dunklau http://dalibo.com - http://dalibo.org
Attachment
Hello. Did you have time to review the latest version of this patch ? Is there anything I can do to get this "ready for commiter" ? Thank you for all the work performed so far. Le mardi 4 février 2014 13:16:22 Ronan Dunklau a écrit : > Le lundi 3 février 2014 23:28:45 Noah Misch a écrit : > > On Sun, Feb 02, 2014 at 11:53:51AM +0100, Ronan Dunklau wrote: > > > Le jeudi 30 janvier 2014 14:05:08 Noah Misch a écrit : > > > > On Thu, Jan 23, 2014 at 03:17:35PM +0100, Ronan Dunklau wrote: > > > > > What do you think about this approach ? Is there something I missed > > > > > which > > > > > would make it not sustainable ? > > > > > > > > Seems basically reasonable. I foresee multiple advantages from having > > > > one > > > > tuplestore per query level as opposed to one for the entire > > > > transaction. > > > > You would remove the performance trap of backing up the tuplestore by > > > > rescanning. It permits reclaiming memory and disk space in > > > > AfterTriggerEndQuery() rather than at end of transaction. You could > > > > remove > > > > ate_ptr1 and ate_ptr2 from AfterTriggerEventDataFDW and just store the > > > > flags word: depending on AFTER_TRIGGER_2CTIDS, grab either the next > > > > one > > > > or > > > > the next two tuples from the tuplestore. Using work_mem per > > > > AfterTriggerBeginQuery() instead of per transaction is no problem. > > > > What > > > > do > > > > you think of that design change? > > > > > > I agree that this design is better, but I have some objections. > > > > > > We can remove ate_ptr2 and rely on the AFTER_TRIGGER_2CTIDS flag, but > > > the > > > rescanning and ate_ptr1 (renamed ate_tupleindex in the attached patch) > > > can't go away. > > > > > > Consider for example the case of a foreign table with more than one > > > AFTER > > > UPDATE triggers. Unless we store the tuples once for each trigger, we > > > will > > > have to rescan the tuplestore. > > > > Will we? Within a given query level, when do (non-deferred) triggers > > execute in an order other than the enqueue order? > > Let me explain what I had in mind. > > Looking at the code in AfterTriggerSaveEvent: > > - we build a "template" AfterTriggerEvent, and store the tuple(s) > - for each suitable after trigger that matches the trigger type, as well as > the WHEN condition if any, a copy of the previously built AfterTriggerEvent > is queued > > Later, those events are fired in order. > > This means that more than one event can be fired for one tuple. > > Take this example: > > CREATE TRIGGER trig_row_after1 > AFTER UPDATE ON rem2 > FOR EACH ROW > WHEN (NEW.f1 % 5 < 3) > EXECUTE PROCEDURE trigger_func('TRIG1'); > > CREATE TRIGGER trig_row_after2 > AFTER UPDATE ON rem2 > FOR EACH ROW > WHEN (NEW.f1 % 5 < 4) > EXECUTE PROCEDURE trigger_func('TRIG2'); > > UPDATE rem2 set f2 = 'something'; > > Assuming 5 rows with f1 as a serial, the fired AfterTriggerEvent's > ate_tupleindex will be, in that order. Ass > > 0-0-2-2-4-8-8 > > So, at least a backward seek is required for trig_row_after2 to be able to > retrieve a tuple that was already consumed when firing trig_row_after1. > > On a side note, this made me realize that it is better to avoid storing a > tuple entirely if there is no enabled trigger (the f1 = 4 case above). The > attached patch does that, so the previous sequence becomes: > > 0-0-2-2-4-6-6 > > It also prevents from initalizing a tuplestore at all if its not needed. > > > > To mitigate the effects of this behaviour, I added the option to perform > > > a > > > reverse_seek when the looked-up tuple is nearer from the current index > > > than > > > from the start. > > > > If there's still a need to seek within the tuplestore, that should get rid > > of the O(n^2) effect. I'm hoping that per-query-level tuplestores will > > eliminate the need to seek entirely. > > I think the only case when seeking is still needed is when there are more > than one after trigger that need to be fired, since the abovementioned > change prevents from seeking to skip tuples. > > > > > If you do pursue that change, make sure the code still does the right > > > > thing > > > > when it drops queued entries during subxact abort. > > > > > > I don't really understand what should be done at that stage. Since > > > triggers on foreign tables are not allowed to be deferred, everything > > > should be cleaned up at the end of each query, right ? So, there > > > shouldn't be any queued entries. > > > > I suspect that's right. If you haven't looked over > > AfterTriggerEndSubXact(), please do so and ensure all its actions still > > make sense in the context of this new kind of trigger storage. > > You're right, I missed something here. When aborting a subxact, the > tuplestores for queries below the subxact query depth should be cleaned, if > any, because AfterTriggerEndQuery has not been called for the failing query. > > The attached patch fixes that. > > > > > > The attached patch checks this, and add documentation for this > > > > > limitation. > > > > > I'm not really sure about how to phrase that correctly in the error > > > > > message > > > > > and the documentation. One can store at most INT_MAX foreign tuples, > > > > > which > > > > > means that at most INT_MAX insert or delete or "half-updates" can > > > > > occur. > > > > > By > > > > > half-updates, I mean that for update two tuples are stored whereas > > > > > in > > > > > contrast to only one for insert and delete trigger. > > > > > > > > > > Besides, I don't know where this disclaimer should be in the > > > > > documentation. > > > > > Any advice here ? > > > > > > > > I wouldn't mention that limitation. > > > > > > Maybe it shouldn't be so prominent, but I still think a note somewhere > > > couldn't hurt. > > > > Perhaps. There's not much documentation of such implementation upper > > limits, and there's no usage of "INT_MAX" outside of parts that discuss > > writing C code. I'm not much of a visionary when it comes to the > > documentation; I try to document new things with an amount of detail > > similar to old features. > > Ok, I removed the paragraph documenting the limitation. > > > > Should the use of work_mem be documented somewhere, too ? > > > > I wouldn't. Most uses of work_mem are undocumented, even relatively major > > ones like count(DISTINCT ...) and CTEs. So, while I'd generally favor a > > patch documenting all/most of the things that use work_mem, it would be > > odd > > to document one new consumer apart from the others. > > Ok. > > > > > This is the performance trap I mentioned above; it brings potential > > > > O(n^2) > > > > complexity to certain AFTER trigger execution scenarios. > > > > > > What scenarios do you have in mind ? I "fixed" the problem when there > > > are > > > multiple triggers on a foreign table, do you have any other one ? > > > > I'm not aware of such a performance trap in your latest design. > > Good ! > > > Thanks, > > nm -- Ronan Dunklau http://dalibo.com - http://dalibo.org
I tried to check the latest (v8) patch again, then could not find problem in your design change from the v7. As Noah pointed out, it uses per query-depth tuplestore being released on AfterTriggerEndSubXact. So, may I mark it as "ready for committer"? 2014-03-03 15:48 GMT+09:00 Ronan Dunklau <ronan.dunklau@dalibo.com>: > Hello. > > Did you have time to review the latest version of this patch ? Is there > anything I can do to get this "ready for commiter" ? > > Thank you for all the work performed so far. > > > > > Le mardi 4 février 2014 13:16:22 Ronan Dunklau a écrit : >> Le lundi 3 février 2014 23:28:45 Noah Misch a écrit : >> > On Sun, Feb 02, 2014 at 11:53:51AM +0100, Ronan Dunklau wrote: >> > > Le jeudi 30 janvier 2014 14:05:08 Noah Misch a écrit : >> > > > On Thu, Jan 23, 2014 at 03:17:35PM +0100, Ronan Dunklau wrote: >> > > > > What do you think about this approach ? Is there something I missed >> > > > > which >> > > > > would make it not sustainable ? >> > > > >> > > > Seems basically reasonable. I foresee multiple advantages from having >> > > > one >> > > > tuplestore per query level as opposed to one for the entire >> > > > transaction. >> > > > You would remove the performance trap of backing up the tuplestore by >> > > > rescanning. It permits reclaiming memory and disk space in >> > > > AfterTriggerEndQuery() rather than at end of transaction. You could >> > > > remove >> > > > ate_ptr1 and ate_ptr2 from AfterTriggerEventDataFDW and just store the >> > > > flags word: depending on AFTER_TRIGGER_2CTIDS, grab either the next >> > > > one >> > > > or >> > > > the next two tuples from the tuplestore. Using work_mem per >> > > > AfterTriggerBeginQuery() instead of per transaction is no problem. >> > > > What >> > > > do >> > > > you think of that design change? >> > > >> > > I agree that this design is better, but I have some objections. >> > > >> > > We can remove ate_ptr2 and rely on the AFTER_TRIGGER_2CTIDS flag, but >> > > the >> > > rescanning and ate_ptr1 (renamed ate_tupleindex in the attached patch) >> > > can't go away. >> > > >> > > Consider for example the case of a foreign table with more than one >> > > AFTER >> > > UPDATE triggers. Unless we store the tuples once for each trigger, we >> > > will >> > > have to rescan the tuplestore. >> > >> > Will we? Within a given query level, when do (non-deferred) triggers >> > execute in an order other than the enqueue order? >> >> Let me explain what I had in mind. >> >> Looking at the code in AfterTriggerSaveEvent: >> >> - we build a "template" AfterTriggerEvent, and store the tuple(s) >> - for each suitable after trigger that matches the trigger type, as well as >> the WHEN condition if any, a copy of the previously built AfterTriggerEvent >> is queued >> >> Later, those events are fired in order. >> >> This means that more than one event can be fired for one tuple. >> >> Take this example: >> >> CREATE TRIGGER trig_row_after1 >> AFTER UPDATE ON rem2 >> FOR EACH ROW >> WHEN (NEW.f1 % 5 < 3) >> EXECUTE PROCEDURE trigger_func('TRIG1'); >> >> CREATE TRIGGER trig_row_after2 >> AFTER UPDATE ON rem2 >> FOR EACH ROW >> WHEN (NEW.f1 % 5 < 4) >> EXECUTE PROCEDURE trigger_func('TRIG2'); >> >> UPDATE rem2 set f2 = 'something'; >> >> Assuming 5 rows with f1 as a serial, the fired AfterTriggerEvent's >> ate_tupleindex will be, in that order. Ass >> >> 0-0-2-2-4-8-8 >> >> So, at least a backward seek is required for trig_row_after2 to be able to >> retrieve a tuple that was already consumed when firing trig_row_after1. >> >> On a side note, this made me realize that it is better to avoid storing a >> tuple entirely if there is no enabled trigger (the f1 = 4 case above). The >> attached patch does that, so the previous sequence becomes: >> >> 0-0-2-2-4-6-6 >> >> It also prevents from initalizing a tuplestore at all if its not needed. >> >> > > To mitigate the effects of this behaviour, I added the option to perform >> > > a >> > > reverse_seek when the looked-up tuple is nearer from the current index >> > > than >> > > from the start. >> > >> > If there's still a need to seek within the tuplestore, that should get rid >> > of the O(n^2) effect. I'm hoping that per-query-level tuplestores will >> > eliminate the need to seek entirely. >> >> I think the only case when seeking is still needed is when there are more >> than one after trigger that need to be fired, since the abovementioned >> change prevents from seeking to skip tuples. >> >> > > > If you do pursue that change, make sure the code still does the right >> > > > thing >> > > > when it drops queued entries during subxact abort. >> > > >> > > I don't really understand what should be done at that stage. Since >> > > triggers on foreign tables are not allowed to be deferred, everything >> > > should be cleaned up at the end of each query, right ? So, there >> > > shouldn't be any queued entries. >> > >> > I suspect that's right. If you haven't looked over >> > AfterTriggerEndSubXact(), please do so and ensure all its actions still >> > make sense in the context of this new kind of trigger storage. >> >> You're right, I missed something here. When aborting a subxact, the >> tuplestores for queries below the subxact query depth should be cleaned, if >> any, because AfterTriggerEndQuery has not been called for the failing query. >> >> The attached patch fixes that. >> >> > > > > The attached patch checks this, and add documentation for this >> > > > > limitation. >> > > > > I'm not really sure about how to phrase that correctly in the error >> > > > > message >> > > > > and the documentation. One can store at most INT_MAX foreign tuples, >> > > > > which >> > > > > means that at most INT_MAX insert or delete or "half-updates" can >> > > > > occur. >> > > > > By >> > > > > half-updates, I mean that for update two tuples are stored whereas >> > > > > in >> > > > > contrast to only one for insert and delete trigger. >> > > > > >> > > > > Besides, I don't know where this disclaimer should be in the >> > > > > documentation. >> > > > > Any advice here ? >> > > > >> > > > I wouldn't mention that limitation. >> > > >> > > Maybe it shouldn't be so prominent, but I still think a note somewhere >> > > couldn't hurt. >> > >> > Perhaps. There's not much documentation of such implementation upper >> > limits, and there's no usage of "INT_MAX" outside of parts that discuss >> > writing C code. I'm not much of a visionary when it comes to the >> > documentation; I try to document new things with an amount of detail >> > similar to old features. >> >> Ok, I removed the paragraph documenting the limitation. >> >> > > Should the use of work_mem be documented somewhere, too ? >> > >> > I wouldn't. Most uses of work_mem are undocumented, even relatively major >> > ones like count(DISTINCT ...) and CTEs. So, while I'd generally favor a >> > patch documenting all/most of the things that use work_mem, it would be >> > odd >> > to document one new consumer apart from the others. >> >> Ok. >> >> > > > This is the performance trap I mentioned above; it brings potential >> > > > O(n^2) >> > > > complexity to certain AFTER trigger execution scenarios. >> > > >> > > What scenarios do you have in mind ? I "fixed" the problem when there >> > > are >> > > multiple triggers on a foreign table, do you have any other one ? >> > >> > I'm not aware of such a performance trap in your latest design. >> >> Good ! >> >> > Thanks, >> > nm > > -- > Ronan Dunklau > http://dalibo.com - http://dalibo.org -- KaiGai Kohei <kaigai@kaigai.gr.jp>
On Mon, Mar 03, 2014 at 11:10:30PM +0900, Kohei KaiGai wrote: > I tried to check the latest (v8) patch again, then could not find > problem in your design change from the v7. > As Noah pointed out, it uses per query-depth tuplestore being released > on AfterTriggerEndSubXact. > > So, may I mark it as "ready for committer"? Yes. Re-reviewing this is next on my list. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
This version looks basically good. I have some cosmetic things to sweep up before commit. One point is a bit more substantial: On Tue, Feb 04, 2014 at 01:16:22PM +0100, Ronan Dunklau wrote: > Le lundi 3 février 2014 23:28:45 Noah Misch a écrit : > > On Sun, Feb 02, 2014 at 11:53:51AM +0100, Ronan Dunklau wrote: > > > We can remove ate_ptr2 and rely on the AFTER_TRIGGER_2CTIDS flag, but the > > > rescanning and ate_ptr1 (renamed ate_tupleindex in the attached patch) > > > can't go away. > > > > > > Consider for example the case of a foreign table with more than one AFTER > > > UPDATE triggers. Unless we store the tuples once for each trigger, we will > > > have to rescan the tuplestore. > > > > Will we? Within a given query level, when do (non-deferred) triggers > > execute in an order other than the enqueue order? > > Let me explain what I had in mind. > > Looking at the code in AfterTriggerSaveEvent: > > - we build a "template" AfterTriggerEvent, and store the tuple(s) > - for each suitable after trigger that matches the trigger type, as well as > the WHEN condition if any, a copy of the previously built AfterTriggerEvent is > queued > > Later, those events are fired in order. > > This means that more than one event can be fired for one tuple. > > Take this example: [snip] Thanks; that illuminated the facts I was missing. > On a side note, this made me realize that it is better to avoid storing a > tuple entirely if there is no enabled trigger (the f1 = 4 case above). The > attached patch does that, so the previous sequence becomes: > > 0-0-2-2-4-6-6 > > It also prevents from initalizing a tuplestore at all if its not needed. That's a sensible improvement. > > > To mitigate the effects of this behaviour, I added the option to perform a > > > reverse_seek when the looked-up tuple is nearer from the current index > > > than > > > from the start. > > > > If there's still a need to seek within the tuplestore, that should get rid > > of the O(n^2) effect. I'm hoping that per-query-level tuplestores will > > eliminate the need to seek entirely. > > I think the only case when seeking is still needed is when there are more than > one after trigger that need to be fired, since the abovementioned change > prevents from seeking to skip tuples. Agreed. More specifically, I see only two scenarios for retrieving tuples from the tuplestore. Scenario one is that we need the next tuple (or pair of tuples, depending on the TriggerEvent). Scenario two is that we need the tuple(s) most recently retrieved. If that's correct, I'm inclined to rearrange afterTriggerInvokeEvents() and AfterTriggerExecute() to remember the tuple or pair of tuples most recently retrieved. They'll then never call tuplestore_advance() just to reposition. Do you see a problem with that? I was again somewhat tempted to remove ate_tupleindex, perhaps by defining the four flag bits this way: #define AFTER_TRIGGER_DONE 0x10000000 #define AFTER_TRIGGER_IN_PROGRESS 0x20000000 /* two bits describing the size of and tuple sources for this event */ #define AFTER_TRIGGER_TUP_BITS 0xC0000000 #define AFTER_TRIGGER_FDW_REUSE 0x00000000 #define AFTER_TRIGGER_FDW_FETCH 0x40000000 #define AFTER_TRIGGER_1CTID 0x80000000 #define AFTER_TRIGGER_2CTID 0xC0000000 AFTER_TRIGGER_FDW_FETCH and AFTER_TRIGGER_FDW_REUSE correspond to the aforementioned scenarios one and two, respectively. I think, though, I'll rate this as needless micro-optimization and not bother; opinions welcome. (The savings is four bytes per foreign table trigger event.) Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Le mercredi 5 mars 2014 22:36:44 Noah Misch a écrit : > Agreed. More specifically, I see only two scenarios for retrieving tuples > from the tuplestore. Scenario one is that we need the next tuple (or pair > of tuples, depending on the TriggerEvent). Scenario two is that we need > the tuple(s) most recently retrieved. If that's correct, I'm inclined to > rearrange afterTriggerInvokeEvents() and AfterTriggerExecute() to remember > the tuple or pair of tuples most recently retrieved. They'll then never > call tuplestore_advance() just to reposition. Do you see a problem with > that? I don't see any problem with that. I don't know how this would be implemented, but it would make sense to avoid those scans, as long as a fresh copy is passed to the trigger: modifications to a tuple performed in an after trigger should not be visible to the next one. > > I was again somewhat tempted to remove ate_tupleindex, perhaps by defining > the four flag bits this way: > > #define AFTER_TRIGGER_DONE 0x10000000 > #define AFTER_TRIGGER_IN_PROGRESS 0x20000000 > /* two bits describing the size of and tuple sources for this event */ > #define AFTER_TRIGGER_TUP_BITS 0xC0000000 > #define AFTER_TRIGGER_FDW_REUSE 0x00000000 > #define AFTER_TRIGGER_FDW_FETCH 0x40000000 > #define AFTER_TRIGGER_1CTID 0x80000000 > #define AFTER_TRIGGER_2CTID 0xC0000000 > > AFTER_TRIGGER_FDW_FETCH and AFTER_TRIGGER_FDW_REUSE correspond to the > aforementioned scenarios one and two, respectively. I think, though, I'll > rate this as needless micro-optimization and not bother; opinions welcome. > (The savings is four bytes per foreign table trigger event.) I was already happy with having a lower footprint for foreign table trigger events than for regular trigger events, but if we remove the need for seeking in the tuplestore entirely, it would make sense to get rid of the index. > > Thanks, > nm Thanks to you. -- Ronan Dunklau http://dalibo.com - http://dalibo.org
I hacked on this for awhile, but there remain two matters on which I'm uncertain about the right way forward. (1) To acquire the old tuple for UPDATE/DELETE operations, the patch closely parallels our handling for INSTEAD OF triggers on views. It adds a wholerow resjunk attribute, from which it constructs a HeapTuple before calling a trigger function. This loses the system columns, an irrelevant consideration for views but meaningful for foreign tables. postgres_fdw maintains the "ctid" system column (t_self), but triggers will always see an invalid t_self for the old tuple. One way to fix this is to pass around the old tuple data as ROW(ctid, oid, xmin, cmin, xmax, cmax, tableoid, wholerow). That's fairly close to sufficient, but it doesn't cover t_ctid. Frankly, I would like to find a principled excuse to not worry about making foreign table system columns accessible from triggers. Supporting system columns dramatically affects the mechanism, and what trigger is likely to care? Unfortunately, that argument seems too weak. Does anyone have a cleaner idea for keeping track of the system column values or a stronger argument for not bothering? (2) When a foreign table has an AFTER ROW trigger, the FDW's ExecForeign{Insert,Update,Delete} callbacks must return a slot covering all columns. Current FDW API documentation says things like this: The data in the returned slot is used only if the INSERT query has a RETURNING clause. Hence, the FDW could choose to optimize away returning some or all columns depending on the contents of the RETURNING clause. Consistent with that, postgres_fdw inspects the returningLists of the ModifyTable node to decide which columns are necessary. This patch has rewriteTargetListIU() add a resjunk wholerow var to the returningList of any query that will fire an AFTER ROW trigger on a foreign table. That avoids the need to change the FDW API contract or any postgres_fdw code. I was pleased about that for a time, but on further review, I'm doubting that the benefit for writable FDWs justifies muddying the definition of returningList; until now, returningList has been free from resjunk TLEs. For example, callers of FetchStatementTargetList() may be unprepared to see an all-resjunk list, instead of NIL, for a data modification statement that returns nothing. If we do keep the patch's approach, I'm inclined to rename returningList. However, I more lean toward adding a separate flag to indicate the need to return a complete tuple regardless of the RETURNING list. The benefits of overloading returningList are all short-term benefits. We know that the FDW API is still converging, so changing it seems eventually-preferable to, and safer than, changing the name or meaning of returningList. Thoughts? On Thu, Mar 06, 2014 at 09:11:19AM +0100, Ronan Dunklau wrote: > Le mercredi 5 mars 2014 22:36:44 Noah Misch a écrit : > > Agreed. More specifically, I see only two scenarios for retrieving tuples > > from the tuplestore. Scenario one is that we need the next tuple (or pair > > of tuples, depending on the TriggerEvent). Scenario two is that we need > > the tuple(s) most recently retrieved. If that's correct, I'm inclined to > > rearrange afterTriggerInvokeEvents() and AfterTriggerExecute() to remember > > the tuple or pair of tuples most recently retrieved. They'll then never > > call tuplestore_advance() just to reposition. Do you see a problem with > > that? > > I don't see any problem with that. I don't know how this would be implemented, > but it would make sense to avoid those scans, as long as a fresh copy is > passed to the trigger: modifications to a tuple performed in an after trigger > should not be visible to the next one. Trigger functions are not permitted to modify tg_trigtuple or tg_newtuple; notice that, for non-foreign triggers, we pass shared_buffers-backed tuples in those fields. Therefore, no copy is necessary. > > I was again somewhat tempted to remove ate_tupleindex, perhaps by defining > > the four flag bits this way: > > > > #define AFTER_TRIGGER_DONE 0x10000000 > > #define AFTER_TRIGGER_IN_PROGRESS 0x20000000 > > /* two bits describing the size of and tuple sources for this event */ > > #define AFTER_TRIGGER_TUP_BITS 0xC0000000 > > #define AFTER_TRIGGER_FDW_REUSE 0x00000000 > > #define AFTER_TRIGGER_FDW_FETCH 0x40000000 > > #define AFTER_TRIGGER_1CTID 0x80000000 > > #define AFTER_TRIGGER_2CTID 0xC0000000 > > > > AFTER_TRIGGER_FDW_FETCH and AFTER_TRIGGER_FDW_REUSE correspond to the > > aforementioned scenarios one and two, respectively. I think, though, I'll > > rate this as needless micro-optimization and not bother; opinions welcome. > > (The savings is four bytes per foreign table trigger event.) > > I was already happy with having a lower footprint for foreign table trigger > events than for regular trigger events, but if we remove the need for seeking > in the tuplestore entirely, it would make sense to get rid of the index. I'm pleased with how this turned out. Besides the memory savings in question, this removed the INT_MAX limitation and simplified the code overall. I did not observe a notable runtime improvement, though that's unsurprising. Other notable changes in the attached revision: 1. UPDATE/DELETE row-level triggers on foreign tables and INSTEAD OF triggers on views have a similar requirement to generate a HeapTuple representing the old row. View triggers did so in nodeModifyTable.c, while foreign table triggers did so in trigger.c. Both were valid choices, but the code siting should not be relkind-dependent without good reason. I centralized this in ExecModifyTable(). 2. Made CREATE TRIGGER forbid INSTEAD OF and TRUNCATE triggers on foreign tables. The documentation already claimed they were unavailable. 3. Fixed pointer arithmetic in AfterTriggerBeginQuery()'s MemSet() call. 4. Modified GetCurrentFDWTuplestore() to allocate the tuplestore in TopTransactionContext. We explicitly put the events list there (specifically, in a documentation-only child of that context), so it seemed more consistent to do the same for the associated foreign tuples. I did not find any live bug from the previous coding, because CurrentMemoryContext always happened to be one that survived past AfterTriggerEndQuery(). 5. Updated comments and documentation that still reflected earlier versions of the patch, as well as older comments obsoleted by the patch. 6. Reverted cosmetic changes, like addition of braces and blank lines, to passages of code not otherwise changing. Please see: https://wiki.postgresql.org/wiki/Creating_Clean_Patches -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Attachment
> I hacked on this for awhile, but there remain two matters on which I'm > uncertain about the right way forward. > > (1) To acquire the old tuple for UPDATE/DELETE operations, the patch closely > parallels our handling for INSTEAD OF triggers on views. It adds a wholerow > resjunk attribute, from which it constructs a HeapTuple before calling a > trigger function. This loses the system columns, an irrelevant > consideration for views but meaningful for foreign tables. postgres_fdw > maintains the "ctid" system column (t_self), but triggers will always see > an invalid t_self for the old tuple. One way to fix this is to pass around > the old tuple data as ROW(ctid, oid, xmin, cmin, xmax, cmax, tableoid, > wholerow). That's fairly close to sufficient, but it doesn't cover t_ctid. > Frankly, I would like to find a principled excuse to not worry about making > foreign table system columns accessible from triggers. Supporting system > columns dramatically affects the mechanism, and what trigger is likely to > care? Unfortunately, that argument seems too weak. Does anyone have a > cleaner idea for keeping track of the system column values or a stronger > argument for not bothering? > Regarding to the first suggestion, I think, it is better not to care about system columns on foreign tables, because it fully depends on driver's implementation whether FDW fetches "ctid" from its data source, or not. Usually, system columns of foreign table, except for tableoid, are nonsense. Because of implementation reason, postgres_fdw fetches "ctid" of remote tables on UPDATE / DELETE, it is not a common nature for all FDW drivers. For example, we can assume an implementation that uses primary key of remote table to identify the record to be updated or deleted. In this case, local "ctid" does not have meaningful value. So, fundamentally, we cannot guarantee FDW driver returns meaningful "ctid" or other system columns. > (2) When a foreign table has an AFTER ROW trigger, the FDW's > ExecForeign{Insert,Update,Delete} callbacks must return a slot covering > all columns. Current FDW API documentation says things like this: > > The data in the returned slot is used only if the INSERT query has a > RETURNING clause. Hence, the FDW could choose to optimize away returning > some or all columns depending on the contents of the RETURNING clause. > > Consistent with that, postgres_fdw inspects the returningLists of the > ModifyTable node to decide which columns are necessary. This patch has > rewriteTargetListIU() add a resjunk wholerow var to the returningList of > any query that will fire an AFTER ROW trigger on a foreign table. That > avoids the need to change the FDW API contract or any postgres_fdw code. > I was pleased about that for a time, but on further review, I'm doubting > that the benefit for writable FDWs justifies muddying the definition of > returningList; until now, returningList has been free from resjunk TLEs. > For example, callers of > FetchStatementTargetList() may be unprepared to see an all-resjunk list, > instead of NIL, for a data modification statement that returns nothing. > > If we do keep the patch's approach, I'm inclined to rename returningList. > However, I more lean toward adding a separate flag to indicate the need > to return a complete tuple regardless of the RETURNING list. The benefits > of overloading returningList are all short-term benefits. We know that > the FDW API is still converging, so changing it seems eventually-preferable > to, and safer than, changing the name or meaning of returningList. > Thoughts? > > On Thu, Mar 06, 2014 at 09:11:19AM +0100, Ronan Dunklau wrote: > > Le mercredi 5 mars 2014 22:36:44 Noah Misch a écrit : > > > Agreed. More specifically, I see only two scenarios for retrieving > > > tuples from the tuplestore. Scenario one is that we need the next > > > tuple (or pair of tuples, depending on the TriggerEvent). Scenario > > > two is that we need the tuple(s) most recently retrieved. If that's > > > correct, I'm inclined to rearrange afterTriggerInvokeEvents() and > > > AfterTriggerExecute() to remember the tuple or pair of tuples most > > > recently retrieved. They'll then never call tuplestore_advance() > > > just to reposition. Do you see a problem with that? > > > > I don't see any problem with that. I don't know how this would be > > implemented, but it would make sense to avoid those scans, as long as > > a fresh copy is passed to the trigger: modifications to a tuple > > performed in an after trigger should not be visible to the next one. > > Trigger functions are not permitted to modify tg_trigtuple or tg_newtuple; > notice that, for non-foreign triggers, we pass shared_buffers-backed tuples > in those fields. Therefore, no copy is necessary. > > > > I was again somewhat tempted to remove ate_tupleindex, perhaps by > > > defining the four flag bits this way: > > > > > > #define AFTER_TRIGGER_DONE 0x10000000 > > > #define AFTER_TRIGGER_IN_PROGRESS 0x20000000 > > > /* two bits describing the size of and tuple sources for this event > */ > > > #define AFTER_TRIGGER_TUP_BITS 0xC0000000 > > > #define AFTER_TRIGGER_FDW_REUSE 0x00000000 > > > #define AFTER_TRIGGER_FDW_FETCH 0x40000000 > > > #define AFTER_TRIGGER_1CTID 0x80000000 > > > #define AFTER_TRIGGER_2CTID 0xC0000000 > > > > > > AFTER_TRIGGER_FDW_FETCH and AFTER_TRIGGER_FDW_REUSE correspond to > > > the aforementioned scenarios one and two, respectively. I think, > > > though, I'll rate this as needless micro-optimization and not bother; > opinions welcome. > > > (The savings is four bytes per foreign table trigger event.) > > > > I was already happy with having a lower footprint for foreign table > > trigger events than for regular trigger events, but if we remove the > > need for seeking in the tuplestore entirely, it would make sense to get > rid of the index. > > I'm pleased with how this turned out. Besides the memory savings in > question, this removed the INT_MAX limitation and simplified the code > overall. I did not observe a notable runtime improvement, though that's > unsurprising. > > > Other notable changes in the attached revision: > > 1. UPDATE/DELETE row-level triggers on foreign tables and INSTEAD OF > triggers on views have a similar requirement to generate a HeapTuple > representing the old row. View triggers did so in nodeModifyTable.c, while > foreign table triggers did so in trigger.c. Both were valid choices, but > the code siting should not be relkind-dependent without good reason. I > centralized this in ExecModifyTable(). > > 2. Made CREATE TRIGGER forbid INSTEAD OF and TRUNCATE triggers on foreign > tables. The documentation already claimed they were unavailable. > > 3. Fixed pointer arithmetic in AfterTriggerBeginQuery()'s MemSet() call. > > 4. Modified GetCurrentFDWTuplestore() to allocate the tuplestore in > TopTransactionContext. We explicitly put the events list there > (specifically, in a documentation-only child of that context), so it seemed > more consistent to do the same for the associated foreign tuples. I did > not find any live bug from the previous coding, because CurrentMemoryContext > always happened to be one that survived past AfterTriggerEndQuery(). > > 5. Updated comments and documentation that still reflected earlier versions > of the patch, as well as older comments obsoleted by the patch. > > 6. Reverted cosmetic changes, like addition of braces and blank lines, to > passages of code not otherwise changing. Please see: > https://wiki.postgresql.org/wiki/Creating_Clean_Patches > > -- > Noah Misch > EnterpriseDB > http://www.enterprisedb.com -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
Le mardi 18 mars 2014 03:54:19 Kouhei Kaigai a écrit : > > I hacked on this for awhile, but there remain two matters on which I'm > > uncertain about the right way forward. > > > > (1) To acquire the old tuple for UPDATE/DELETE operations, the patch > > closelyparallels our handling for INSTEAD OF triggers on views. It > > adds a wholerow resjunk attribute, from which it constructs a HeapTuple > > before calling a trigger function. This loses the system columns, an > > irrelevant > > consideration for views but meaningful for foreign tables. postgres_fdw > > maintains the "ctid" system column (t_self), but triggers will always see > > an invalid t_self for the old tuple. One way to fix this is to pass > > aroundthe old tuple data as ROW(ctid, oid, xmin, cmin, xmax, cmax, > > tableoid, wholerow). That's fairly close to sufficient, but it doesn't > > cover t_ctid. Frankly, I would like to find a principled excuse to not > > worry about making foreign table system columns accessible from triggers. > > Supporting system columns dramatically affects the mechanism, and what > > trigger is likely to care? Unfortunately, that argument seems too weak. > > Does anyone have a cleaner idea for keeping track of the system column > > values or a stronger argument for not bothering? > > > > Regarding to the first suggestion, > I think, it is better not to care about system columns on foreign tables, > because it fully depends on driver's implementation whether FDW fetches > "ctid" from its data source, or not. > Usually, system columns of foreign table, except for tableoid, are > nonsense.Because of implementation reason, postgres_fdw fetches "ctid" of > remote tables on UPDATE / DELETE, it is not a common nature for all FDW > drivers. For example, we can assume an implementation that uses primary key > of remote table to identify the record to be updated or deleted. In this > case, local "ctid" does not have meaningful value. > So, fundamentally, we cannot guarantee FDW driver returns meaningful "ctid" > or other system columns. > I agree, I think it is somewhat clunky to have postgres_fdw use a feature that is basically meaningless for other FDWs. Looking at some threads in this list, it confused many people. This is off-topic, but maybe we could devise an API allowing for local "system attributes" on foreign table. This would allow FDWs to carry attributes that weren't declared as part of the table definition. This could then be used for postgres_fdw ctid, as well as others foreign data wrappers equivalent of an implicit "tuple id". > > > (2) When a foreign table has an AFTER ROW trigger, the FDW's > > ExecForeign{Insert,Update,Delete} callbacks must return a slot covering > > all columns. Current FDW API documentation says things like this: > > > > > > The data in the returned slot is used only if the INSERT query has a > > RETURNING clause. Hence, the FDW could choose to optimize away > > returning > > some or all columns depending on the contents of the RETURNING clause. > > > > > > Consistent with that, postgres_fdw inspects the returningLists of the > > ModifyTable node to decide which columns are necessary. This patch has > > rewriteTargetListIU() add a resjunk wholerow var to the returningList of > > any query that will fire an AFTER ROW trigger on a foreign table. That > > avoids the need to change the FDW API contract or any postgres_fdw code. > > I was pleased about that for a time, but on further review, I'm doubting > > that the benefit for writable FDWs justifies muddying the definition of > > returningList; until now, returningList has been free from resjunk TLEs. > > For example, callers of > > FetchStatementTargetList() may be unprepared to see an all-resjunk list, > > instead of NIL, for a data modification statement that returns nothing. > > > > If we do keep the patch's approach, I'm inclined to rename returningList. > > However, I more lean toward adding a separate flag to indicate the need > > to return a complete tuple regardless of the RETURNING list. The > > benefits > > of overloading returningList are all short-term benefits. We know that > > the FDW API is still converging, so changing it seems > > eventually-preferableto, and safer than, changing the name or meaning > > of returningList. Thoughts? > > > > On Thu, Mar 06, 2014 at 09:11:19AM +0100, Ronan Dunklau wrote: > > > > > Le mercredi 5 mars 2014 22:36:44 Noah Misch a écrit : > > > > > > > Agreed. More specifically, I see only two scenarios for retrieving > > > > tuples from the tuplestore. Scenario one is that we need the next > > > > tuple (or pair of tuples, depending on the TriggerEvent). Scenario > > > > two is that we need the tuple(s) most recently retrieved. If that's > > > > correct, I'm inclined to rearrange afterTriggerInvokeEvents() and > > > > AfterTriggerExecute() to remember the tuple or pair of tuples most > > > > recently retrieved. They'll then never call tuplestore_advance() > > > > just to reposition. Do you see a problem with that? > > > > > > > > > > > > I don't see any problem with that. I don't know how this would be > > > implemented, but it would make sense to avoid those scans, as long as > > > a fresh copy is passed to the trigger: modifications to a tuple > > > performed in an after trigger should not be visible to the next one. > > > > > > Trigger functions are not permitted to modify tg_trigtuple or > > tg_newtuple; > > notice that, for non-foreign triggers, we pass shared_buffers-backed > > tuplesin those fields. Therefore, no copy is necessary. > > > > > > > > I was again somewhat tempted to remove ate_tupleindex, perhaps by > > > > defining the four flag bits this way: > > > > > > > > > > > > > > > > #define AFTER_TRIGGER_DONE 0x10000000 > > > > #define AFTER_TRIGGER_IN_PROGRESS 0x20000000 > > > > /* two bits describing the size of and tuple sources for this event > > > > */ > > > > > > #define AFTER_TRIGGER_TUP_BITS 0xC0000000 > > > > #define AFTER_TRIGGER_FDW_REUSE 0x00000000 > > > > #define AFTER_TRIGGER_FDW_FETCH 0x40000000 > > > > #define AFTER_TRIGGER_1CTID 0x80000000 > > > > #define AFTER_TRIGGER_2CTID 0xC0000000 > > > > > > > > > > > > > > > > AFTER_TRIGGER_FDW_FETCH and AFTER_TRIGGER_FDW_REUSE correspond to > > > > the aforementioned scenarios one and two, respectively. I think, > > > > though, I'll rate this as needless micro-optimization and not bother; > > > > opinions welcome. > > > > > > (The savings is four bytes per foreign table trigger event.) > > > > > > > > > > > > I was already happy with having a lower footprint for foreign table > > > trigger events than for regular trigger events, but if we remove the > > > need for seeking in the tuplestore entirely, it would make sense to get > > > > rid of the index. > > > > I'm pleased with how this turned out. Besides the memory savings in > > question, this removed the INT_MAX limitation and simplified the code > > overall. I did not observe a notable runtime improvement, though that's > > unsurprising. > > > > > > Other notable changes in the attached revision: > > > > 1. UPDATE/DELETE row-level triggers on foreign tables and INSTEAD OF > > triggers on views have a similar requirement to generate a HeapTuple > > representing the old row. View triggers did so in nodeModifyTable.c, > > whileforeign table triggers did so in trigger.c. Both were valid > > choices, but the code siting should not be relkind-dependent without good > > reason. I centralized this in ExecModifyTable(). > > > > 2. Made CREATE TRIGGER forbid INSTEAD OF and TRUNCATE triggers on foreign > > tables. The documentation already claimed they were unavailable. > > > > 3. Fixed pointer arithmetic in AfterTriggerBeginQuery()'s MemSet() call. > > > > 4. Modified GetCurrentFDWTuplestore() to allocate the tuplestore in > > TopTransactionContext. We explicitly put the events list there > > (specifically, in a documentation-only child of that context), so it > > seemedmore consistent to do the same for the associated foreign tuples. > > I did not find any live bug from the previous coding, because > > CurrentMemoryContext always happened to be one that survived past > > AfterTriggerEndQuery(). > > 5. Updated comments and documentation that still reflected earlier > > versionsof the patch, as well as older comments obsoleted by the patch. > > > > 6. Reverted cosmetic changes, like addition of braces and blank lines, to > > passages of code not otherwise changing. Please see: > > https://wiki.postgresql.org/wiki/Creating_Clean_Patches > > > > -- > > Noah Misch > > EnterpriseDB > > http://www.enterprisedb.com > > > -- > NEC OSS Promotion Center / PG-Strom Project > KaiGai Kohei <kaigai@ak.jp.nec.com> > -- Ronan Dunklau http://dalibo.com - http://dalibo.org
On Mon, Mar 17, 2014 at 11:54 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: >> I hacked on this for awhile, but there remain two matters on which I'm >> uncertain about the right way forward. >> >> (1) To acquire the old tuple for UPDATE/DELETE operations, the patch closely >> parallels our handling for INSTEAD OF triggers on views. It adds a wholerow >> resjunk attribute, from which it constructs a HeapTuple before calling a >> trigger function. This loses the system columns, an irrelevant >> consideration for views but meaningful for foreign tables. postgres_fdw >> maintains the "ctid" system column (t_self), but triggers will always see >> an invalid t_self for the old tuple. One way to fix this is to pass around >> the old tuple data as ROW(ctid, oid, xmin, cmin, xmax, cmax, tableoid, >> wholerow). That's fairly close to sufficient, but it doesn't cover t_ctid. >> Frankly, I would like to find a principled excuse to not worry about making >> foreign table system columns accessible from triggers. Supporting system >> columns dramatically affects the mechanism, and what trigger is likely to >> care? Unfortunately, that argument seems too weak. Does anyone have a >> cleaner idea for keeping track of the system column values or a stronger >> argument for not bothering? >> > Regarding to the first suggestion, > I think, it is better not to care about system columns on foreign tables, > because it fully depends on driver's implementation whether FDW fetches > "ctid" from its data source, or not. > Usually, system columns of foreign table, except for tableoid, are nonsense. > Because of implementation reason, postgres_fdw fetches "ctid" of remote > tables on UPDATE / DELETE, it is not a common nature for all FDW drivers. > For example, we can assume an implementation that uses primary key of remote > table to identify the record to be updated or deleted. In this case, local > "ctid" does not have meaningful value. > So, fundamentally, we cannot guarantee FDW driver returns meaningful "ctid" > or other system columns. I'm not sure I particularly agree with this reasoning - after all, just because some people might not find a feature useful isn't a reason not to have it. On the other hand, I don't think it's a very useful feature, and I don't feel like we have to have it. Most system columns can't be updated or indexed, and none of them can be dropped or renamed, so it's not like they aren't second-class citizens to some degree already. By way of comparison, the first version of the index-only scan patch gave up when it saw an expression index, instead of making an effort to figure out whether a matching expression was present in the query. Somebody could have looked at that patch and said, oh, well, that's an ugly and unacceptable limitation, and we ought to reject the patch until it's fixed. Well, instead, Tom committed the patch, and we still have that limitation today, and it's still something we really ought to fix some day, but in the meantime we have the feature. Obviously, the line between "your patch is only part of a feature, please finish it and try again" and "your patch implements a nice self-contained feature and there are some more things that we could build on top of it later" is to some extent a matter of judgement; but for what it's worth, I can't get too excited about this particular limitation of this particular patch. I just don't think that very many people are going to miss the functionality in question. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 18, 2014 at 09:31:06AM +0100, Ronan Dunklau wrote: > Le mardi 18 mars 2014 03:54:19 Kouhei Kaigai a écrit : > > > (1) To acquire the old tuple for UPDATE/DELETE operations, the patch > > > closely > parallels our handling for INSTEAD OF triggers on views. It > > > adds a wholerow resjunk attribute, from which it constructs a HeapTuple > > > before calling a trigger function. This loses the system columns, an > > > irrelevant > > > consideration for views but meaningful for foreign tables. postgres_fdw > > > maintains the "ctid" system column (t_self), but triggers will always see > > > an invalid t_self for the old tuple. One way to fix this is to pass > > > around > the old tuple data as ROW(ctid, oid, xmin, cmin, xmax, cmax, > > > tableoid, wholerow). That's fairly close to sufficient, but it doesn't > > > cover t_ctid. Frankly, I would like to find a principled excuse to not > > > worry about making foreign table system columns accessible from triggers. > > > Supporting system columns dramatically affects the mechanism, and what > > > trigger is likely to care? Unfortunately, that argument seems too weak. > > > Does anyone have a cleaner idea for keeping track of the system column > > > values or a stronger argument for not bothering? > > > > > > > Regarding to the first suggestion, > > I think, it is better not to care about system columns on foreign tables, > > because it fully depends on driver's implementation whether FDW fetches > > "ctid" from its data source, or not. > > Usually, system columns of foreign table, except for tableoid, are > > nonsense. > Because of implementation reason, postgres_fdw fetches "ctid" of > > remote tables on UPDATE / DELETE, it is not a common nature for all FDW > > drivers. For example, we can assume an implementation that uses primary key > > of remote table to identify the record to be updated or deleted. In this > > case, local "ctid" does not have meaningful value. > > So, fundamentally, we cannot guarantee FDW driver returns meaningful "ctid" > > or other system columns. > > > > I agree, I think it is somewhat clunky to have postgres_fdw use a feature that > is basically meaningless for other FDWs. Looking at some threads in this list, > it confused many people. My own reasoning for accepting omission of system columns is more along the lines of Robert's argument. Regardless, three folks voting to do so and none against suffices for me. I documented the system columns limitation, made the returningList change I mentioned, and committed the patch. > This is off-topic, but maybe we could devise an API allowing for local "system > attributes" on foreign table. This would allow FDWs to carry attributes that > weren't declared as part of the table definition. This could then be used for > postgres_fdw ctid, as well as others foreign data wrappers equivalent of an > implicit "tuple id". We could, but I discourage it. System columns are a legacy feature; I doubt we'd choose that design afresh today. On the off-chance that you need the value of a remote system column, you can already declare an ordinary foreign table column for it. I raised the issue because it's inconsistent for RETURNING to convey system columns while tg_trigtuple/tg_newtuple do not, not because acquiring system columns from foreign tables is notably useful. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Le dimanche 23 mars 2014 02:44:26 Noah Misch a écrit : > On Tue, Mar 18, 2014 at 09:31:06AM +0100, Ronan Dunklau wrote: > > Le mardi 18 mars 2014 03:54:19 Kouhei Kaigai a écrit : > > > > (1) To acquire the old tuple for UPDATE/DELETE operations, the patch > > > > closely > > > > parallels our handling for INSTEAD OF triggers on views. It > > > > > > adds a wholerow resjunk attribute, from which it constructs a > > > > HeapTuple > > > > before calling a trigger function. This loses the system columns, an > > > > irrelevant > > > > consideration for views but meaningful for foreign tables. > > > > postgres_fdw > > > > maintains the "ctid" system column (t_self), but triggers will always > > > > see > > > > an invalid t_self for the old tuple. One way to fix this is to pass > > > > around > > > > the old tuple data as ROW(ctid, oid, xmin, cmin, xmax, cmax, > > > > > > tableoid, wholerow). That's fairly close to sufficient, but it > > > > doesn't > > > > cover t_ctid. Frankly, I would like to find a principled excuse to not > > > > worry about making foreign table system columns accessible from > > > > triggers. > > > > > > > > Supporting system columns dramatically affects the mechanism, and > > > > what > > > > > > > > trigger is likely to care? Unfortunately, that argument seems too > > > > weak. > > > > Does anyone have a cleaner idea for keeping track of the system column > > > > values or a stronger argument for not bothering? > > > > > > Regarding to the first suggestion, > > > I think, it is better not to care about system columns on foreign > > > tables, > > > because it fully depends on driver's implementation whether FDW fetches > > > "ctid" from its data source, or not. > > > Usually, system columns of foreign table, except for tableoid, are > > > nonsense. > > > > Because of implementation reason, postgres_fdw fetches "ctid" of > > > > > remote tables on UPDATE / DELETE, it is not a common nature for all FDW > > > drivers. For example, we can assume an implementation that uses primary > > > key > > > of remote table to identify the record to be updated or deleted. In this > > > case, local "ctid" does not have meaningful value. > > > So, fundamentally, we cannot guarantee FDW driver returns meaningful > > > "ctid" > > > or other system columns. > > > > I agree, I think it is somewhat clunky to have postgres_fdw use a feature > > that is basically meaningless for other FDWs. Looking at some threads in > > this list, it confused many people. > > My own reasoning for accepting omission of system columns is more along the > lines of Robert's argument. Regardless, three folks voting to do so and > none against suffices for me. I documented the system columns limitation, > made the returningList change I mentioned, and committed the patch. Thank you, I'm glad the patch found its way to the repository ! > > > This is off-topic, but maybe we could devise an API allowing for local > > "system attributes" on foreign table. This would allow FDWs to carry > > attributes that weren't declared as part of the table definition. This > > could then be used for postgres_fdw ctid, as well as others foreign data > > wrappers equivalent of an implicit "tuple id". > > We could, but I discourage it. System columns are a legacy feature; I doubt > we'd choose that design afresh today. On the off-chance that you need the > value of a remote system column, you can already declare an ordinary > foreign table column for it. I raised the issue because it's inconsistent > for RETURNING to convey system columns while tg_trigtuple/tg_newtuple do > not, not because acquiring system columns from foreign tables is notably > useful. The idea here was to allow an FDW author to add columns which are not part of the table definition, for example colums which are required to identify the tuple remotely. Without system columns, a postgres_fdw user would have to declare the ctid column on every table for a tuble to be identifiable. The proposal would allow postgres_fdw to automatically inject an hidden (system ?) column on every table for this ctid. -- Ronan Dunklau http://dalibo.com - http://dalibo.org