Thread: Making AFTER triggers act properly in PL functions
Here's another thing that I think it would be good to fix in 8.0. We've had numerous complaints before about RI constraints not firing inside PL functions; a recent one is http://archives.postgresql.org/pgsql-bugs/2004-09/msg00020.php I think also that the current behavior violates the SQL spec. SQL99 4.17 says: The checking of a constraint depends on its constraint mode within the current SQL-transaction. If the constraintmode is immediate, then the constraint is effectively checked at the end of each SQL- statement. NOTE 13 - This includes SQL-statements that are executed as a direct result or an indirect result of executinga different SQL- statement. The NOTE is a bit opaque but I think it has to be taken as meaning that if a PL function contains DML commands then the results of those DML commands are checked when they complete, not when the outermost statement completes. Therefore, I think the behavior we want is that immediate-mode AFTER triggers fire at completion of the statement that queued them, no matter whether this was an interactive statement or one inside a function. Deferred triggers, as now, fire only at end of top-level transaction (or when changed to immediate mode by SET CONSTRAINTS). Changing this obviously risks breaking existing applications, but we'll have to do it eventually. 8.0 seems like the appropriate time. Implementation proposal: I was originally thinking we could just invoke DeferredTriggerEndQuery after each query is finished, but that doesn't work. Imagine the situation where a PL function is being called repeatedly by an UPDATE query (which is generating trigger events), and the PL function is doing queries that themselves fire triggers. We want completion of a query within the PL function to fire only those events queued by that query, not events already queued by the outer UPDATE. The outer UPDATE's events should be processed only when it finishes. To handle this, I think we need to introduce a DeferredTriggerBeginQuery function to match DeferredTriggerEndQuery, and make SPI and other callers of the executor essentially do DeferredTriggerBeginQuery()ExecutorStart()ExecutorRun()ExecutorEnd()DeferredTriggerEndQuery() trigger.c will have to be prepared to cope with nested begin/end query calls. I envision it working like this: * BeginQuery establishes a new, empty list of AFTER triggers for the new query. This is pushed onto a stack of active queries for the current (sub)transaction. * In the executor, any request to queue an AFTER trigger event queues it onto the topmost query's list. (Error if no active query, so you MUST have done DeferredTriggerBeginQuery.) * EndQuery processes and discards immediate-mode AFTER trigger events for the current query. Any remaining events (ie, DEFERRED triggers) are appended to the current (sub)transaction's list of pending deferred triggers. Note that even inside a subtransaction, we can discard immediate-mode events. * EndSubXact, in the commit case, expects the stack of open queries for the current subxact to be empty (error if not); in the abort case, pop and discard any open queries, along with any deferred triggers of the subxact. * EndXact and DeferredTriggerSetState continue to act the same as before. In particular, DeferredTriggerSetState need pay no attention to trigger events that are still in lists belonging to open queries; those events aren't ready to fire yet. Comments? regards, tom lane
On Tue, 7 Sep 2004, Tom Lane wrote: > * EndQuery processes and discards immediate-mode AFTER trigger events for the > current query. Any remaining events (ie, DEFERRED triggers) are appended > to the current (sub)transaction's list of pending deferred triggers. > Note that even inside a subtransaction, we can discard immediate-mode > events. > > * EndXact and DeferredTriggerSetState continue to act the same as before. > In particular, DeferredTriggerSetState need pay no attention to trigger > events that are still in lists belonging to open queries; those events > aren't ready to fire yet. > > Comments? If I'm reading the above correctly, I think DeferredTriggerSetState may need to change a little if EndQuery works on a separate list of triggers because I believe set constraints immediate currently depends on EndQuery going over the entire list of saved deferred triggers.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > If I'm reading the above correctly, I think DeferredTriggerSetState may > need to change a little if EndQuery works on a separate list of triggers > because I believe set constraints immediate currently depends on EndQuery > going over the entire list of saved deferred triggers. But it would. What I'm imagining is that the current list remains the same, but it only contains trigger events from already-completed statements. The per-query lists would be "staging areas" for gathering events from still-active statements. The only case where DeferredTriggerSetState would even see any nonempty per-query list is where you had SET CONSTRAINTS being executed inside a PL function that is called from an INSERT/UPDATE/DELETE command that has already generated some trigger events, but is not yet complete. It is not appropriate for those triggers to fire yet, because we haven't completed their generating statement. When we do complete it, we'll fire or defer the triggers according to the then-active SET CONSTRAINTS state. So AFAICS, DeferredTriggerSetState can and should ignore open per-query trigger lists. It will go over the whole list of events from prior statements, though, the same as it does now. regards, tom lane
On Tue, 7 Sep 2004, Tom Lane wrote: > Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > > If I'm reading the above correctly, I think DeferredTriggerSetState may > > need to change a little if EndQuery works on a separate list of triggers > > because I believe set constraints immediate currently depends on EndQuery > > going over the entire list of saved deferred triggers. > > But it would. What I'm imagining is that the current list remains the > same, but it only contains trigger events from already-completed statements. > The per-query lists would be "staging areas" for gathering events from > still-active statements. I misread then. I thought that you were proposing that EndQuery look only at the per-query list and then add the deferred items that weren't fired to the main list but never go over that list.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > I misread then. I thought that you were proposing that EndQuery look only > at the per-query list and then add the deferred items that weren't fired > to the main list but never go over that list. It will have to re-examine the tail of the main list, as well as process the current per-query list. I haven't really done any coding yet, but I think this could be done pretty easily by appending the per-query list to the main list (an easy pointer swing) and then proceeding as before. Or it might be better to do it in two phases --- I'm not sure how hard it is to keep track of whether it's safe to recycle fully-fired events. Knowing that you are processing triggers generated by the current query might be a useful leg up on that task. Also, it's probably reasonable to assume that SET CONSTRAINTS doesn't queue any new triggers of its own, meaning that in any given EndQuery call only one of these tasks (rescan old triggers or scan new ones) can be needed. That might or might not be worth exploiting. regards, tom lane
On Tue, 7 Sep 2004, Tom Lane wrote: > Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > > I misread then. I thought that you were proposing that EndQuery look only > > at the per-query list and then add the deferred items that weren't fired > > to the main list but never go over that list. > > It will have to re-examine the tail of the main list, as well as process > the current per-query list. I haven't really done any coding yet, but > I think this could be done pretty easily by appending the per-query list > to the main list (an easy pointer swing) and then proceeding as before. > Or it might be better to do it in two phases --- I'm not sure how hard > it is to keep track of whether it's safe to recycle fully-fired events. > Knowing that you are processing triggers generated by the current query > might be a useful leg up on that task. > > Also, it's probably reasonable to assume that SET CONSTRAINTS doesn't > queue any new triggers of its own, meaning that in any given EndQuery > call only one of these tasks (rescan old triggers or scan new ones) > can be needed. That might or might not be worth exploiting. Hmm, if our current state of deferred triggers look like (in order)Trigger ATrigger BTrigger C and triggers A and B are made immediate and scanning begins at the beginning of the queue again, during the execution of the Trigger A trigger function, if an update is done to a table with an immediate after trigger (D), does the firing order look like: Trigger A start Trigger D start Trigger D endTrigger A endTrigger B startTrigger B end or something else? What if trigger D calls set constraints to make Trigger C immediate?
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > Hmm, if our current state of deferred triggers look like (in order) > Trigger A > Trigger B > Trigger C > and triggers A and B are made immediate and scanning begins at the > beginning of the queue again, during the execution of the Trigger A > trigger function, if an update is done to a table with an immediate after > trigger (D), does the firing order look like: > Trigger A start > Trigger D start > Trigger D end > Trigger A end > Trigger B start > Trigger B end Yeah, I would think so. > What if trigger D calls set constraints to make > Trigger C immediate? That would be a query within D, so C would fire within D. regards, tom lane
On Tue, 7 Sep 2004, Tom Lane wrote: > Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > > Hmm, if our current state of deferred triggers look like (in order) > > Trigger A > > Trigger B > > Trigger C > > > and triggers A and B are made immediate and scanning begins at the > > beginning of the queue again, during the execution of the Trigger A > > trigger function, if an update is done to a table with an immediate after > > trigger (D), does the firing order look like: > > > Trigger A start > > Trigger D start > > Trigger D end > > Trigger A end > > Trigger B start > > Trigger B end > > Yeah, I would think so. > > > What if trigger D calls set constraints to make > > Trigger C immediate? > > That would be a query within D, so C would fire within D. Right, but if we search the entire trigger queue from the beginning looking for all triggers now immediate and fire them in the EndQuery of the set constraints statement contained in D, we'd potentially get an ordering like: Trigger A startTrigger D start Trigger B start Trigger B end Trigger C start Trigger C endTrigger D end Trigger A end rather than: Trigger A startTrigger D start Trigger C start Trigger C endTrigger D end Trigger A end Trigger B start Trigger B end where I'd gather the latter is the intended ordering.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > Right, but if we search the entire trigger queue from the beginning > looking for all triggers now immediate and fire them in the EndQuery of > the set constraints statement contained in D, we'd potentially get an > ordering like: > Trigger A start > Trigger D start > Trigger B start > Trigger B end > Trigger C start > Trigger C end > Trigger D end > Trigger A end > rather than: > Trigger A start > Trigger D start > Trigger C start > Trigger C end > Trigger D end > Trigger A end > Trigger B start > Trigger B end > where I'd gather the latter is the intended ordering. I think it'd be very debatable which order is "intended". I don't feel a strong need to promise one of these orders over the other. It does occur to me though that there's another hazard here: refiring trigger A which is already-in-progress. We'll need to add another flag indicating that to the trigger queue entries ... regards, tom lane
On Wed, 8 Sep 2004, Tom Lane wrote: > Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > > Right, but if we search the entire trigger queue from the beginning > > looking for all triggers now immediate and fire them in the EndQuery of > > the set constraints statement contained in D, we'd potentially get an > > ordering like: > > > Trigger A start > > Trigger D start > > Trigger B start > > Trigger B end > > Trigger C start > > Trigger C end > > Trigger D end > > Trigger A end > > > rather than: > > > Trigger A start > > Trigger D start > > Trigger C start > > Trigger C end > > Trigger D end > > Trigger A end > > Trigger B start > > Trigger B end > > > where I'd gather the latter is the intended ordering. > > I think it'd be very debatable which order is "intended". I don't feel > a strong need to promise one of these orders over the other. Okay. The former seems odd to me, especially for exception handling since Trigger D is making Trigger C immediate, but it could receive exceptions for Trigger B, so it couldn't assume it knows the source of the exception (C or something done due to C's execution) if it did something like: BEGIN SET CONSTRAINTS C IMMEDIATE;EXCEPTION WHEN ... THEN ...END; But it may not be a big deal. > It does occur to me though that there's another hazard here: refiring > trigger A which is already-in-progress. We'll need to add another flag > indicating that to the trigger queue entries ... Yeah, I thought of that after sending, but figured it was easily dealt with.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > Okay. The former seems odd to me, especially for exception handling since > Trigger D is making Trigger C immediate, but it could receive exceptions > for Trigger B, so it couldn't assume it knows the source of the exception > (C or something done due to C's execution) if it did something like: > BEGIN > SET CONSTRAINTS C IMMEDIATE; > EXCEPTION WHEN ... THEN > ... > END; > But it may not be a big deal. >> It does occur to me though that there's another hazard here: refiring >> trigger A which is already-in-progress. We'll need to add another flag >> indicating that to the trigger queue entries ... > Yeah, I thought of that after sending, but figured it was easily dealt > with. Hmm. Here's a slightly off the wall idea: following SET CONSTRAINTS, scan the pending-triggers list twice. The first time, you determine which triggers you need to fire, and mark them "in progress" by your transaction. The second time through, you actually fire the ones you marked, and change their marking to "done". The "in progress" ones wouldn't be touched by the hypothetical inner SET CONSTRAINTS. It wouldn't quite work to use just transaction ID as the marker, since the inner SET CONSTRAINTS is very possibly done without using a subtransaction. But command ID or query nesting level or some such would work. I think the main concern here would be the space cost of adding still another field to the trigger records ... is it worth it? regards, tom lane
On Wed, 8 Sep 2004, Tom Lane wrote: > Hmm. Here's a slightly off the wall idea: following SET CONSTRAINTS, > scan the pending-triggers list twice. The first time, you determine > which triggers you need to fire, and mark them "in progress" by your > transaction. The second time through, you actually fire the ones you > marked, and change their marking to "done". The "in progress" ones > wouldn't be touched by the hypothetical inner SET CONSTRAINTS. > > It wouldn't quite work to use just transaction ID as the marker, since > the inner SET CONSTRAINTS is very possibly done without using a > subtransaction. But command ID or query nesting level or some such > would work. I think the main concern here would be the space cost of > adding still another field to the trigger records ... is it worth it? Would it be possible to basically alias the space for dte_done_xid to hold either the xid if it's done or the <whatever> if it's in progress? That's ugly, but it would presumably not increase the size of the record.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > On Wed, 8 Sep 2004, Tom Lane wrote: >> I think the main concern here would be the space cost of >> adding still another field to the trigger records ... is it worth it? > Would it be possible to basically alias the space for dte_done_xid to hold > either the xid if it's done or the <whatever> if it's in progress? That's > ugly, but it would presumably not increase the size of the record. Yeah, I was wondering the same, but hadn't quite worked out the details. The difficulty is that if trigger execution is abandoned due to an error, you'd have to be able to recognize which entries weren't really "in progress" anymore. If the values aren't XIDs, I'm not sure how to do that. One thing I was looking at was that we aren't using nearly all the bits of dti_state. It'd be possible to narrow that to int16 and then shoehorn a 16-bit query nesting depth counter into the freed space. Not sure if this is enough though. Actually, I'd really like to get it back down to the 7.4 size, which was already too big :-(. That might be a vain hope though. regards, tom lane
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > On Wed, 8 Sep 2004, Tom Lane wrote: >> As long as we're talking about hack-slash-and-burn on this data >> structure ... > Where the OtherInformation could be shared within the statement (for > identical events)? I think it'd be problematic to try sharing between > statements. Yeah, I had come to the same conclusion after more thought. But we could certainly aggregate all the similar events generated by a single query into a common status structure. > But, I'm sort of assuming the following are true: > Once a group of items is marked to be run, all items will run even if set > constraints ... deferred happens while the run occurs. That's a good question. If the first trigger firing tries to set the event deferred, what happens to the remaining triggers? The SQL spec doesn't even touch this question, so I think we are at liberty to do what we like. I don't see that it's unreasonable to continue to fire events that were marked as firable when we reached the end of the current statement. > If an error occurs, either the entire set of event objects for the > statement are going away because they're new, or if it was something run > from set constraints we're going to want to rerun the entire set anyway. Right, that was what I was thinking. regards, tom lane
On Wed, 8 Sep 2004, Tom Lane wrote: > Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > > On Wed, 8 Sep 2004, Tom Lane wrote: > >> As long as we're talking about hack-slash-and-burn on this data > >> structure ... > > > Where the OtherInformation could be shared within the statement (for > > identical events)? I think it'd be problematic to try sharing between > > statements. > > Yeah, I had come to the same conclusion after more thought. But we > could certainly aggregate all the similar events generated by a single > query into a common status structure. Definately. The ~20 byte/row gain for large updates/insert/delete is worth it. I think it'd actually increase the size for the single row case since we'd have the pointer to deal with (we could use a flag that tells us whether this item actually has a pointer to a shared status structure or just contains the status structure but that seems kinda ugly). > > But, I'm sort of assuming the following are true: > > Once a group of items is marked to be run, all items will run even if set > > constraints ... deferred happens while the run occurs. > > That's a good question. If the first trigger firing tries to set the > event deferred, what happens to the remaining triggers? The SQL spec > doesn't even touch this question, so I think we are at liberty to do > what we like. I don't see that it's unreasonable to continue to fire > events that were marked as firable when we reached the end of the > current statement. That's what I figured, especially if a function called in an update that does a set constraints doesn't act upon the triggers being queued effectively until the end of statement.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > Definately. The ~20 byte/row gain for large updates/insert/delete is > worth it. I think it'd actually increase the size for the single row case > since we'd have the pointer to deal with (we could use a flag that tells > us whether this item actually has a pointer to a shared status structure > or just contains the status structure but that seems kinda ugly). Yeah. I can't see that anyone will care about another few bytes in single-row cases --- the other per-query overheads will swamp this one. The only cases we've ever heard complaints about are this-query-updated- umpteen-zillion rows cases, and they were always umpteen zillion cases of the same trigger. regards, tom lane
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > On Wed, 8 Sep 2004, Tom Lane wrote: >> It wouldn't quite work to use just transaction ID as the marker, since >> the inner SET CONSTRAINTS is very possibly done without using a >> subtransaction. But command ID or query nesting level or some such >> would work. I think the main concern here would be the space cost of >> adding still another field to the trigger records ... is it worth it? > Would it be possible to basically alias the space for dte_done_xid to hold > either the xid if it's done or the <whatever> if it's in progress? That's > ugly, but it would presumably not increase the size of the record. I found a way to do this, which actually is to forget the done_xid field altogether and just store the "firing ID" number. Since firing ID increases monotonically throughout a transaction, all triggers fired during a given subtransaction will have IDs >= the subtransaction-start- time ID counter. So we can clean up by looking for that, which is much faster than the TransactionId testing anyway. regards, tom lane
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > On Wed, 8 Sep 2004, Tom Lane wrote: >> Yeah, I had come to the same conclusion after more thought. But we >> could certainly aggregate all the similar events generated by a single >> query into a common status structure. > Definately. The ~20 byte/row gain for large updates/insert/delete is > worth it. I think it'd actually increase the size for the single row case > since we'd have the pointer to deal with (we could use a flag that tells > us whether this item actually has a pointer to a shared status structure > or just contains the status structure but that seems kinda ugly). I have given up on this idea for the moment, as on further examination it seems to require a lot of work to get any improvement. The code I just committed uses a 32-byte (on 32-bit machines anyway) data structure for each trigger event. Allowing for palloc overhead, that's 40 bytes per event. If we separated it into two parts, the per-tuple part would still need 20 bytes per event (a list link pointer, a link to the shared part, and 2 tuple item pointers). Because palloc would round such a requested size up to the next power of 2, there would actually be no savings at all, unless we were willing to hack up palloc to have a special case for this request size. Which is not beyond the realm of reason, certainly, but even with no round-up the effective space requirement would be 28 bytes. Doing a lot of work to get from 40 to 28 bytes doesn't excite me. I spent some time thinking about whether the per-tuple stuff could be kept in large arrays, so that we eliminate both the list links and the palloc overhead, bringing it down to 16 bytes per event. This would be enough savings to be worth some trouble, but the management seems really messy, mainly because retail deletion of fired events isn't easy anymore. I decided trying to get this done for 8.0 wasn't going to be practical. Possibly someone will take another look in a future release cycle. regards, tom lane
On Wed, 8 Sep 2004, Tom Lane wrote: > I wrote: > > Actually, I'd really like to get it back down to the 7.4 size, which was > > already too big :-(. That might be a vain hope though. > > As long as we're talking about hack-slash-and-burn on this data > structure ... > > The cases where people get annoyed by the size of the deferred trigger > list are nearly always cases where the exact same trigger is to be fired > on a large number of tuples from the same relation (ie, we're doing a > mass INSERT, mass UPDATE, etc). Since it's the exact same trigger, all > these events must have identical deferrability properties, and will all > be fired (or not fired) at the same points. > > So it seems to me that we could refactor the data structure into some > per-trigger stuff (tgoid, relid, xid, flag bits) associated with an > array of per-event records that hold only the old/new ctid fields, and > get it down to about 12 bytes per tuple instead of forty-some. > > However this would lose the current properties concerning event > firing order. Could we do something where each event stores just > a pointer to some per-trigger data (shared across all like events) > plus the old and new ctid fields? 16 bytes is still way better than > 44. Something like the main items being:- next pointer for list- old ctid- new ctid- pointer to other information with other information:- event- relid,- done xid- n_items- dte_item array Where the OtherInformation could be shared within the statement (for identical events)? I think it'd be problematic to try sharing between statements. But, I'm sort of assuming the following are true:Once a group of items is marked to be run, all items will run even if set constraints ... deferred happens while the run occurs.If set constraints is called inside a function used in a statement(like update foo set bar=f(baz) where f() calls set constraints) the entire queue runs with one particular deferrability.If an error occurs, either the entire set of event objects for the statement are going away because they're new, or if it was something run from set constraints we're going to want to rerun the entire set anyway.
I wrote: > Actually, I'd really like to get it back down to the 7.4 size, which was > already too big :-(. That might be a vain hope though. As long as we're talking about hack-slash-and-burn on this data structure ... The cases where people get annoyed by the size of the deferred trigger list are nearly always cases where the exact same trigger is to be fired on a large number of tuples from the same relation (ie, we're doing a mass INSERT, mass UPDATE, etc). Since it's the exact same trigger, all these events must have identical deferrability properties, and will all be fired (or not fired) at the same points. So it seems to me that we could refactor the data structure into some per-trigger stuff (tgoid, relid, xid, flag bits) associated with an array of per-event records that hold only the old/new ctid fields, and get it down to about 12 bytes per tuple instead of forty-some. However this would lose the current properties concerning event firing order. Could we do something where each event stores just a pointer to some per-trigger data (shared across all like events) plus the old and new ctid fields? 16 bytes is still way better than 44. Thoughts? Am I missing some reason why we could not share status data across multiple tuples, if their events are otherwise identical? If we fail partway through processing the trigger events, I don't see that we care exactly where. regards, tom lane