Thread: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Bharath Rupireddy
Date:
Hi, I found an issue while executing a backup use case(please see [1] for queries) on postgres version 12. Firstly, pg_start_backup registers nonexclusive_base_backup_cleanup as on_shmem_exit call back which will add this function to the before_shmem_exit_list array which is supposed to be removed on pg_stop_backup so that we can do the pending cleanup and issue a warning for each pg_start_backup for which we did not call the pg_stop backup. Now, I executed a query for which JIT is picked up, then the the llvm compiler inserts it's own exit callback i.e. llvm_shutdown at the end of before_shmem_exit_list array. Now, I executed pg_stop_backup and call to cancel_before_shmem_exit() is made with the expectation that the nonexclusive_base_backup_cleanup callback is removed from before_shmem_exit_list array. Since the cancel_before_shmem_exit() only checks the last entry in the before_shmem_exit_list array, which is llvm compiler's exit callback, so we exit the cancel_before_shmem_exit() without removing the intended nonexclusive_base_backup_cleanup callback which remains still the before_shmem_exit_list and gets executed during the session exit throwing a warning "aborting backup due to backend exiting before pg_stop_backup was called", which is unintended. Attached is the patch that fixes the above problem by making cancel_before_shmem_exit() to look for the given function(and for given args) in the entire before_shmem_exit_list array, not just the last entry, starting from the last entry. Request the community take this patch for review for v12. Having said that, abovementioned problem for backup use case does not occur for v13 and latest versions of postgres (please below description[2]), but these kinds of issues can come, if the cancel_before_shmem_exit() is left to just look at the last array entry while removing a registered callback. There's also a comment in cancel_before_shmem_exit() function description "For simplicity, only the latest entry can be removed. (We could work harder but there is no need for current uses.) Since we start to find use cases now, there is a need to enhance cancel_before_shmem_exit(), so I also propose to have the same attached patch for v13 and latest versions. Thoughts? [1] CREATE TABLE t1 (id SERIAL); INSERT INTO t1 (id) SELECT * FROM generate_series(1, 20000000); SELECT * FROM pg_start_backup('label', false, false); /*JIT is enabled in my session and it is being picked by below query*/ EXPLAIN (ANALYZE, VERBOSE, BUFFERS) SELECT COUNT(*) FROM t1; SELECT * FROM pg_stop_backup(false, true); [2] for v13 and latest versions, start_backup first registers do_pg_abort_backup, and then pg_start_backup_callback, performs startup backup operations and unregisters only pg_start_backup_callback from before_shmem_exit_list, retaining do_pg_abort_backup still in the list, which is to be called on session's exit JIT compiler inserts it's own exit call back at the end of before_shmem_exit_list array. stop_backup registers pg_stop_backup_callback, performs stop operations, unregisters pg_stop_backup_callback from before_shmem_exit_list, and sets the sessionBackupState = SESSION_BACKUP_NONE, note that the callback do_pg_abort_backup registered by start_backup command still exists in the before_shmem_exit_list and will not be removed by stop_backup. On session exit, do_pg_abort_backup gets called but returns without performing any operations(not even throwing a warning), by checking sessionBackupState which was set to SESSION_BACKUP_NONE by stop_backup. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Tom Lane
Date:
Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes: > Firstly, pg_start_backup registers nonexclusive_base_backup_cleanup as > on_shmem_exit call back which will > add this function to the before_shmem_exit_list array which is > supposed to be removed on pg_stop_backup > so that we can do the pending cleanup and issue a warning for each > pg_start_backup for which we did not call > the pg_stop backup. Now, I executed a query for which JIT is picked > up, then the the llvm compiler inserts it's > own exit callback i.e. llvm_shutdown at the end of > before_shmem_exit_list array. Now, I executed pg_stop_backup > and call to cancel_before_shmem_exit() is made with the expectation > that the nonexclusive_base_backup_cleanup > callback is removed from before_shmem_exit_list array. I'm of the opinion that the JIT code is abusing this mechanism, and the right thing to do is fix that. The restriction you propose to remove is not just there out of laziness, it's an expectation about what safe use of this mechanism would involve. Un-ordered removal of callbacks seems pretty risky: it would mean that whatever cleanup is needed is not going to be done in LIFO order. regards, tom lane
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Bharath Rupireddy
Date:
On Tue, Jul 7, 2020 at 10:24 PM Andres Freund <andres@anarazel.de> wrote: > > On 2020-07-07 09:44:41 -0400, Tom Lane wrote: > > Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes: > > > Firstly, pg_start_backup registers nonexclusive_base_backup_cleanup as > > > on_shmem_exit call back which will > > > add this function to the before_shmem_exit_list array which is > > > supposed to be removed on pg_stop_backup > > > so that we can do the pending cleanup and issue a warning for each > > > pg_start_backup for which we did not call > > > the pg_stop backup. Now, I executed a query for which JIT is picked > > > up, then the the llvm compiler inserts it's > > > own exit callback i.e. llvm_shutdown at the end of > > > before_shmem_exit_list array. Now, I executed pg_stop_backup > > > and call to cancel_before_shmem_exit() is made with the expectation > > > that the nonexclusive_base_backup_cleanup > > > callback is removed from before_shmem_exit_list array. > > > > I'm of the opinion that the JIT code is abusing this mechanism, and the > > right thing to do is fix that. > > What are you proposing? For now we could easily enough work around this > by just making it a on_proc_exit() callback, but that doesn't really > change the fundamental issue imo. > > > The restriction you propose to remove is not just there out of > > laziness, it's an expectation about what safe use of this mechanism > > would involve. Un-ordered removal of callbacks seems pretty risky: it > > would mean that whatever cleanup is needed is not going to be done in > > LIFO order. > I quickly searched(in HEAD) what all the callbacks are getting registered to before_shmem_exit_list, with the intention to see if they also call corresponding cancel_before_shmem_exit() after their intended usage is done. For few of the callbacks there is no cancel_before_shmem_exit(). This seems expected; those callbacks ought to be executed before shmem exit. These callbacks are(let say SET 1): ShutdownPostgres, logicalrep_worker_onexit, llvm_shutdown, Async_UnlistenOnExit, RemoveTempRelationsCallback, ShutdownAuxiliaryProcess, do_pg_abort_backup in xlog.c (this callback exist only in v13 or later), AtProcExit_Twophase. Which means, once they are into the before_shmem_exit_list array, in some order, they are never going to be removed from it as they don't have corresponding cancel_before_shmem_exit() and the relative order of execution remains the same. And there are other callbacks that are getting registered to before_shmem_exit_list array(let say SET 2): apw_detach_shmem, _bt_end_vacuum_callback, pg_start_backup_callback, pg_stop_backup_callback, createdb_failure_callback, movedb_failure_callback, do_pg_abort_backup(in basebackup.c). They all have corresponding cancel_before_shmem_exit() to unregister/remove the callbacks from before_shmem_exit_list array. I think the callbacks that have no cancel_before_shmem_exit()(SET 1) may have to be executed in the LIFO order: it makes sense to execute ShutdownPostgres at the end after let's say other callbacks in SET 1. And the SET 2 callbacks have cancel_before_shmem_exit() with the only intention that there's no need to call the callbacks on the before_shmem_exit(), since they are not needed, and try to remove from the before_shmem_exit_list array and may fail, if any other callback gets registered in between. If I'm not wrong with the above points, we must enhance cancel_before_shmem_exit() or have cancel_before_shmem_exit_v2() (as mentioned in my below response). > > Maybe I am confused, but isn't it <13's pg_start_backup() that's > violating the protocol much more clearly than the JIT code? Given that > it relies on there not being any callbacks registered between two SQL > function calls? I mean, what it does is basically: > > 1) before_shmem_exit(nonexclusive_base_backup_cleanup... > 2) arbitrary code executed for a long time > 3) cancel_before_shmem_exit(nonexclusive_base_backup_cleanup... > > which pretty obviously can't at all deal with any other > before_shmem_exit callbacks being registered in 2). Won't this be a > problem for every other before_shmem_exit callback that we register > on-demand? Say Async_UnlistenOnExit, RemoveTempRelationsCallback? > Yes, for versions <13's, clearly pg_start_backup causes the problem and the issue can also be reproduced with Async_UnlistenOnExit, RemoveTempRelationsCallback coming in between pg_start_backup and pg_stop_backup. We can have it fixed in a few ways: 1) enhance cancel_before_shmem_exit() as attached in the original patch. 2) have existing cancel_before_shmem_exit(), whenever called for nonexclusive_base_backup_cleanup(), we can look for the entire array instead of just the last entry. 3) have a separate function, say, cancel_before_shmem_exit_v2(), that searches for the entire before_shmem_exit_list array(the logic proposed in this patch) so that it will not disturb the existing cancel_before_shmem_exit(). 4) or try to have the pg_start_backup code that exists in after > 13 versions. If okay to have cancel_before_shmem_exit_v2() for versions < 13's, with the searching for the entire array instead of just the last element to fix the abort issue, maybe we can have this function in version 13 and latest as well(at least with disable mode, something like #if 0 ... #endif), so that in case if any of similar issues arise we could just quickly reuse. If the before_shmem_exit_list array is to be used in LIFO order, do we have some comment/usage guideline mentioned in the ipc.c/.h? I didn't find one. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Robert Haas
Date:
On Tue, Jul 7, 2020 at 12:55 PM Andres Freund <andres@anarazel.de> wrote: > What are you proposing? For now we could easily enough work around this > by just making it a on_proc_exit() callback, but that doesn't really > change the fundamental issue imo. I think it would be more correct for it to be an on_proc_exit() callback, because before_shmem_exit() callbacks can and do perform actions which rely on an awful lot of the system being still in a working state. RemoveTempRelationsCallback() is a good example: it thinks it can start and end transactions and make a bunch of catalog changes. I don't know that any of that could use JIT, but moving the JIT cleanup to the on_shmem_exit() stage seems better. At that point, there shouldn't be anybody doing anything that relies on being able to perform logical changes to the database; we're just shutting down low-level subsystems at that point, and thus presumably not doing anything that could possibly need JIT. But I also agree that what pg_start_backup() was doing before v13 was wrong; that's why I committed 303640199d0436c5e7acdf50b837a027b5726594. The only reason I didn't back-patch it is because the consequences are so minor I didn't think it was worth worrying about. We could, though. I'd be somewhat inclined to both do that and also change LLVM to use on_proc_exit() in master, but I don't feel super-strongly about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Bharath Rupireddy
Date:
On Tue, Jul 21, 2020 at 1:17 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Jul 7, 2020 at 12:55 PM Andres Freund <andres@anarazel.de> wrote:
> > What are you proposing? For now we could easily enough work around this
> > by just making it a on_proc_exit() callback, but that doesn't really
> > change the fundamental issue imo.
>
> I think it would be more correct for it to be an on_proc_exit()
> callback, because before_shmem_exit() callbacks can and do perform
> actions which rely on an awful lot of the system being still in a
> working state. RemoveTempRelationsCallback() is a good example: it
> thinks it can start and end transactions and make a bunch of catalog
> changes. I don't know that any of that could use JIT, but moving the
> JIT cleanup to the on_shmem_exit() stage seems better. At that point,
> there shouldn't be anybody doing anything that relies on being able to
> perform logical changes to the database; we're just shutting down
> low-level subsystems at that point, and thus presumably not doing
> anything that could possibly need JIT.
>
I looked at what actually llvm_shutdown() does? It frees up JIT stacks, also if exists perf related resource, using LLVMOrcDisposeInstance() and LLVMOrcUnregisterPerf(), that were dynamically allocated in llvm_session_initialize through a JIT library function LLVMOrcCreateInstance() [1].
It looks like there is no problem in moving llvm_shutdown to either on_shmem_exit() or on_proc_exit().
> But I also agree that what pg_start_backup() was doing before v13 was
> wrong; that's why I committed
> 303640199d0436c5e7acdf50b837a027b5726594. The only reason I didn't
> back-patch it is because the consequences are so minor I didn't think
> it was worth worrying about. We could, though. I'd be somewhat
> inclined to both do that and also change LLVM to use on_proc_exit() in
> master, but I don't feel super-strongly about it.
>
Patch: v1-0001-Move-llvm_shutdown-to-on_proc_exit-list-from-befo.patch
Moved llvm_shutdown to on_proc_exit() call back list. Request to consider this change for master, if possible <=13 versions. Basic JIT use cases and regression tests are working fine with the patch.
Patches: PG11-0001-Fix-minor-problems-with-non-exclusive-backup-clea.patch and PG12-0001-Fix-minor-problems-with-non-exclusive-backup-cleanup.patch
Request to consider the commit 303640199d0436c5e7acdf50b837a027b5726594(above two patches are for this commit) to versions < 13, to fix the abort issue. Please note that the above two patches have no difference in the code, just I made it applicable on PG11.
Patch: v1-0001-Modify-cancel_before_shmem_exit-comments.patch
This patch, modifies cancel_before_shmem_exit() function comment to reflect the safe usage of before_shmem_exit_list callback mechanism and also removes the point "For simplicity, only the latest entry can be removed*********" as this gives a meaning that there is still scope for improvement in cancel_before_shmem_exit() search mechanism.
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Robert Haas
Date:
On Fri, Jul 24, 2020 at 7:10 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > I looked at what actually llvm_shutdown() does? It frees up JIT stacks, also if exists perf related resource, using LLVMOrcDisposeInstance()and LLVMOrcUnregisterPerf(), that were dynamically allocated in llvm_session_initialize through aJIT library function LLVMOrcCreateInstance() [1]. > > It looks like there is no problem in moving llvm_shutdown to either on_shmem_exit() or on_proc_exit(). If it doesn't involve shared memory, I guess it can be on_proc_exit() rather than on_shmem_exit(). I guess the other question is why we're doing it at all. What resources are being allocated that wouldn't be freed up by process exit anyway? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Bharath Rupireddy
Date:
On Fri, Jul 24, 2020 at 8:07 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 7:10 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > I looked at what actually llvm_shutdown() does? It frees up JIT stacks, also if exists perf related resource, using LLVMOrcDisposeInstance() and LLVMOrcUnregisterPerf(), that were dynamically allocated in llvm_session_initialize through a JIT library function LLVMOrcCreateInstance() [1].
> >
> > It looks like there is no problem in moving llvm_shutdown to either on_shmem_exit() or on_proc_exit().
>
> If it doesn't involve shared memory, I guess it can be on_proc_exit()
> rather than on_shmem_exit().
>
> I guess the other question is why we're doing it at all. What
> resources are being allocated that wouldn't be freed up by process
> exit anyway?
>
LLVMOrcCreateInstance() and LLVMOrcDisposeInstance() are doing new and delete respectively, I just found these functions from the link [1]. But I don't exactly know whether there are any other resources being allocated that can't be freed up by proc_exit(). Tagging @Andres Freund for inputs on whether we have any problem making llvm_shutdown() a on_proc_exit() callback instead of before_shmem_exit() callback.
And as suggested in the previous mails, we wanted to make it on_proc_exit() to avoid the abort issue reported in this mail chain, however if we take the abort issue fix commit # 303640199d0436c5e7acdf50b837a027b5726594 as mentioned in the previous response[2], then it may not be necessary, right now, but just to be safer and to avoid any of these similar kind of issues in future, we can consider this change as well.
LLVMOrcJITStackRef LLVMOrcCreateInstance(LLVMTargetMachineRef TM) {
TargetMachine *TM2(unwrap(TM));
Triple T(TM2->getTargetTriple());
auto IndirectStubsMgrBuilder =
orc::createLocalIndirectStubsManagerBuilder(T);
OrcCBindingsStack *JITStack =
new OrcCBindingsStack(*TM2, std::move(IndirectStubsMgrBuilder));
return wrap(JITStack);
}
LLVMErrorRef LLVMOrcDisposeInstance(LLVMOrcJITStackRef JITStack) {
auto *J = unwrap(JITStack);
auto Err = J->shutdown();
delete J;
return wrap(std::move(Err));
}
[2] - https://www.postgresql.org/message-id/CALj2ACVwOKZ8qYUsZrU2y2efnYZOLRxPC6k52FQcB3oriH9Kcg%40mail.gmail.com
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Robert Haas
Date:
On Mon, Jul 20, 2020 at 3:47 PM Robert Haas <robertmhaas@gmail.com> wrote: > But I also agree that what pg_start_backup() was doing before v13 was > wrong; that's why I committed > 303640199d0436c5e7acdf50b837a027b5726594. The only reason I didn't > back-patch it is because the consequences are so minor I didn't think > it was worth worrying about. We could, though. I'd be somewhat > inclined to both do that and also change LLVM to use on_proc_exit() in > master, but I don't feel super-strongly about it. Unless somebody complains pretty soon, I'm going to go ahead and do what is described above. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Robert Haas
Date:
On Thu, Jul 30, 2020 at 8:11 AM Robert Haas <robertmhaas@gmail.com> wrote: > Unless somebody complains pretty soon, I'm going to go ahead and do > what is described above. Done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Bharath Rupireddy
Date:
On Thu, Aug 6, 2020 at 11:51 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jul 30, 2020 at 8:11 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Unless somebody complains pretty soon, I'm going to go ahead and do > > what is described above. > > Done. > Thanks! I have one more request to make: since we are of the opinion to not change the way cancel_before_shmem_exit() searches before_shmem_exit_list array, wouldn't it be good to adjust comments before the function cancel_before_shmem_exit()? I sent the patch previously[1], but attaching here again, modifies cancel_before_shmem_exit() function comment to reflect the safe usage of before_shmem_exit_list callback mechanism and also removes the point "For simplicity, only the latest entry can be removed*********" as this gives a meaning that there is still scope for improvement in cancel_before_shmem_exit() search mechanism. Thoughts? [1] - https://www.postgresql.org/message-id/CALj2ACVwOKZ8qYUsZrU2y2efnYZOLRxPC6k52FQcB3oriH9Kcg%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Robert Haas
Date:
On Thu, Aug 6, 2020 at 11:46 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > I sent the patch previously[1], but attaching here again, modifies > cancel_before_shmem_exit() function comment to reflect the safe usage > of before_shmem_exit_list callback mechanism and also removes the > point "For simplicity, only the latest entry can be removed*********" > as this gives a meaning that there is still scope for improvement in > cancel_before_shmem_exit() search mechanism. > > Thoughts? I think that the first part of the comment change you suggest is a good idea and would avoid developer confusion, but I think that the statement about unordered removal of comments being risky doesn't add much. It's too vague to help anybody and I don't think I believe it, either. So I suggest something more like: - * callback. For simplicity, only the latest entry can be - * removed. (We could work harder but there is no need for - * current uses.) + * callback. We only look at the latest entry for removal, as we + * expect the caller to use before_shmem_exit callback mechanism + * in the LIFO order. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > ... So I suggest something more like: > - * callback. For simplicity, only the latest entry can be > - * removed. (We could work harder but there is no need for > - * current uses.) > + * callback. We only look at the latest entry for removal, as we > + * expect the caller to use before_shmem_exit callback mechanism > + * in the LIFO order. That's a meaningless statement for any one caller. So it needs to be more like "we expect callers to add and remove temporary before_shmem_exit callbacks in strict LIFO order". I wonder whether we ought to change the function to complain if the last list entry doesn't match. We'd have caught this bug sooner if it did, and it's not very clear why silently doing nothing is a good idea when there's no match. regards, tom lane
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Robert Haas
Date:
On Fri, Aug 7, 2020 at 1:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > That's a meaningless statement for any one caller. So it needs to be more > like "we expect callers to add and remove temporary before_shmem_exit > callbacks in strict LIFO order". Sure, that seems fine. > I wonder whether we ought to change the function to complain if the > last list entry doesn't match. We'd have caught this bug sooner > if it did, and it's not very clear why silently doing nothing is > a good idea when there's no match. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Andres Freund
Date:
Hi, On 2020-08-07 12:29:03 -0400, Robert Haas wrote: > On Thu, Aug 6, 2020 at 11:46 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > I sent the patch previously[1], but attaching here again, modifies > > cancel_before_shmem_exit() function comment to reflect the safe usage > > of before_shmem_exit_list callback mechanism and also removes the > > point "For simplicity, only the latest entry can be removed*********" > > as this gives a meaning that there is still scope for improvement in > > cancel_before_shmem_exit() search mechanism. > > > > Thoughts? > > I think that the first part of the comment change you suggest is a > good idea and would avoid developer confusion, but I think that the > statement about unordered removal of comments being risky doesn't add > much. It's too vague to help anybody and I don't think I believe it, > either. So I suggest something more like: > > - * callback. For simplicity, only the latest entry can be > - * removed. (We could work harder but there is no need for > - * current uses.) > + * callback. We only look at the latest entry for removal, as we > + * expect the caller to use before_shmem_exit callback mechanism > + * in the LIFO order. In which situations is the removal actually useful *and* safe, with these constraints? You'd have to have a very narrow set of functions that are called while the exit hook is present, i.e. basically this would only be usable for PG_ENSURE_ERROR_CLEANUP and nothing else. And even there it seems like it's pretty easy to get into a situation where it's not safe. Greetings, Andres Freund
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Bharath Rupireddy
Date:
On Fri, Aug 7, 2020 at 11:09 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Aug 7, 2020 at 1:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > That's a meaningless statement for any one caller. So it needs to be more > > like "we expect callers to add and remove temporary before_shmem_exit > > callbacks in strict LIFO order". > > Sure, that seems fine. > v2 patch has the comments modified. > > > I wonder whether we ought to change the function to complain if the > > last list entry doesn't match. We'd have caught this bug sooner > > if it did, and it's not very clear why silently doing nothing is > > a good idea when there's no match. > > +1. > This is a good idea. v3 patch has both the modified comments(from v2) as well as a DEBUG3 (DEBUG3 level, because the other non-error/non-fatal logs in ipc.c are using the same level) log to report when the latest entry for removal is not matched with the one the caller cancel_before_shmem_exit() is looking for and a hint on how to safely use temporary before_shmem_exit() callbacks. In v3 patch, function pointer is being printed, I'm not sure how much it is helpful to have function pointers in the logs though there are some other places printing pointers into the logs, I wish I could print function names. (Is there a way we could get function names from function pointers?). Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Robert Haas
Date:
On Fri, Aug 7, 2020 at 5:20 PM Andres Freund <andres@anarazel.de> wrote: > In which situations is the removal actually useful *and* safe, with > these constraints? You'd have to have a very narrow set of functions > that are called while the exit hook is present, i.e. basically this > would only be usable for PG_ENSURE_ERROR_CLEANUP and nothing else. And > even there it seems like it's pretty easy to get into a situation where > it's not safe. Well, I don't really care whether or not we change this function to iterate over the callback list or whether we add a warning that you need to use it in LIFO order, but I think we should do one or the other, because this same confusion has come up multiple times. I thought that Tom was opposed to making it iterate over the callback list (for reasons I don't really understand, honestly) so adding a comment and a cross-check seemed like the practical option. Now I also think it's fine to iterate over the callback list: this function doesn't get used so much that it's likely to be a performance problem, and I don't think this is the first bug that would have become a non-bug had we done that years and years ago whenever it was first proposed. In fact, I'd go so far as to say that the latter is a slightly better option. However, doing nothing is clearly worst. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > Well, I don't really care whether or not we change this function to > iterate over the callback list or whether we add a warning that you > need to use it in LIFO order, but I think we should do one or the > other, because this same confusion has come up multiple times. I > thought that Tom was opposed to making it iterate over the callback > list (for reasons I don't really understand, honestly) so adding a > comment and a cross-check seemed like the practical option. Now I also > think it's fine to iterate over the callback list: this function > doesn't get used so much that it's likely to be a performance problem, > and I don't think this is the first bug that would have become a > non-bug had we done that years and years ago whenever it was first > proposed. In fact, I'd go so far as to say that the latter is a > slightly better option. However, doing nothing is clearly worst. I agree that doing nothing seems like a bad idea. My concern about allowing non-LIFO callback removal is that it seems to me that such usage patterns have likely got *other* bugs, so we should discourage that. These callbacks don't exist in a vacuum: they reflect that the mainline code path has set up, or torn down, important state. Non-LIFO usage requires very strong assumptions that the states in question are not interdependent, and that's something I'd rather not rely on if we don't have to. regards, tom lane
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Robert Haas
Date:
On Mon, Aug 10, 2020 at 10:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I agree that doing nothing seems like a bad idea. My concern about > allowing non-LIFO callback removal is that it seems to me that such > usage patterns have likely got *other* bugs, so we should discourage > that. These callbacks don't exist in a vacuum: they reflect that > the mainline code path has set up, or torn down, important state. > Non-LIFO usage requires very strong assumptions that the states in > question are not interdependent, and that's something I'd rather not > rely on if we don't have to. I have mixed feelings about this. On the one hand, I think saying that it requires "very strong assumptions" about interdependence makes it sound scarier than it is -- not that your statement is wrong, but that in most cases we kinda know whether that's true or not, and the idea that you shouldn't remove a callback upon which some later callback might be depending is not too difficult for anybody to understand. On the other hand, I think that in most cases we ought to be discouraging people who are trying to do per-subsystem cleanup from using before_shmem_exit() at all. I think such callbacks ought to be done using on_shmem_exit() if they involve shared memory or on_proc_exit() if they do not. Those functions don't have cancel_blah_exit() variants, and I don't think they should: the right way to code those things is not to remove the callbacks from the stack when they're no longer needed, but rather to code the callbacks so that they will do nothing if no work is required, leaving them permanently registered. And the main reason why I think that such callbacks should be registered using on_shmem_exit() or on_proc_exit() rather than before_shmem_exit() is because of (1) the interdependency issue you raise and (2) the fact that cancel_before_shmem_exit doesn't do what people tend to think it does. For an example of (1), look at ShutdownPostgres() and RemoveTempRelationsCallback(). The latter aborts out of any transaction and then starts a new one to drop your temp schema. But that might fail, so ShutdownPostgres() also needs to be prepared to AbortOutOfAnyTransaction(). If you inserted more cleanup steps that were thematically similar to those, each one of them would also need to begin with AbortOutOfAnyTransaction(). That suggests that this whole thing is a bit under-engineered. Some of the other before_shmem_exit() callbacks don't start with that incantation, but that's because they are per-subsystem callbacks that likely ought to be using on_shmem_exit() rather than actually being the same sort of thing. Perhaps we really have four categories here: (1) Temporary handlers for PG_ENSURE_ERROR_CLEANUP(). (2) High-level cleanup that needs to run after aborting out of the current transaction. (3) Per-subsystem shutdown for shared memory stuff. (4) Per-subsystem shutdown for backend-private stuff. Right now we blend (1), (2), and some of (3) together, but we could try to create a cleaner line. We could redefine before_shmem_exit() as being exactly #2, and abort out of any transaction before calling each step, and document that you shouldn't use it unless you need that behavior. And we could have a separate stack for #1 that is explicitly LIFO and not intended for any other use. But then again maybe that's overkill. What I do think we should do, after thinking about it more, is discourage the casual use of before_shmem_exit() for things where on_shmem_exit() or on_proc_exit() would be just as good. I think that's what would avoid the most problems here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > Perhaps we really have four categories here: > (1) Temporary handlers for PG_ENSURE_ERROR_CLEANUP(). > (2) High-level cleanup that needs to run after aborting out of the > current transaction. > (3) Per-subsystem shutdown for shared memory stuff. > (4) Per-subsystem shutdown for backend-private stuff. Hmm, I don't think we actually have any of (2) do we? Or at least we aren't using ipc.c callbacks for them. > What I do think we should do, after thinking about it more, > is discourage the casual use of before_shmem_exit() for things where > on_shmem_exit() or on_proc_exit() would be just as good. I think > that's what would avoid the most problems here. I think we're mostly in violent agreement here. The interesting question seems to be Andres' one about whether before_shmem_exit actually has any safe use except for PG_ENSURE_ERROR_CLEANUP. It may not, in which case perhaps we oughta rename it? regards, tom lane
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Robert Haas
Date:
On Mon, Aug 10, 2020 at 3:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Perhaps we really have four categories here: > > (1) Temporary handlers for PG_ENSURE_ERROR_CLEANUP(). > > (2) High-level cleanup that needs to run after aborting out of the > > current transaction. > > (3) Per-subsystem shutdown for shared memory stuff. > > (4) Per-subsystem shutdown for backend-private stuff. > > Hmm, I don't think we actually have any of (2) do we? Or at least > we aren't using ipc.c callbacks for them. Well, I was thinking about the place where ShutdownPostgres() does LockReleaseAll(), and also the stuff in RemoveTempRelationsCallback(). Those are pretty high-level operations that need to happen before we start shutting down subsystems. Especially the removal of temp relations. > > What I do think we should do, after thinking about it more, > > is discourage the casual use of before_shmem_exit() for things where > > on_shmem_exit() or on_proc_exit() would be just as good. I think > > that's what would avoid the most problems here. > > I think we're mostly in violent agreement here. The interesting > question seems to be Andres' one about whether before_shmem_exit > actually has any safe use except for PG_ENSURE_ERROR_CLEANUP. > It may not, in which case perhaps we oughta rename it? If we could eliminate the other places where it's used, that'd be great. That's not too clear to me, though, because of the above two cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Andres Freund
Date:
Hi, On 2020-08-10 15:50:19 -0400, Robert Haas wrote: > On Mon, Aug 10, 2020 at 3:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > What I do think we should do, after thinking about it more, > > > is discourage the casual use of before_shmem_exit() for things where > > > on_shmem_exit() or on_proc_exit() would be just as good. I think > > > that's what would avoid the most problems here. > > > > I think we're mostly in violent agreement here. The interesting > > question seems to be Andres' one about whether before_shmem_exit > > actually has any safe use except for PG_ENSURE_ERROR_CLEANUP. > > It may not, in which case perhaps we oughta rename it? > If we could eliminate the other places where it's used, that'd be > great. That's not too clear to me, though, because of the above two > cases. I think there's two different aspects here: Having before_shmem_exit(), and having cancel_before_shmem_exit(). We could just not have the latter, and instead use a separate list for PG_ENSURE_ERROR_CLEANUP internally. With the callback for PG_ENSURE_ERROR_CLEANUP calling those from its private list. There's no other uses of cancel_before_shmem_exit afaict. I guess alternatively we at some point might just need a more complex callback system, where one can specify where in relation to another callback a callback needs to be registered etc. Greetings, Andres Freund
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > I think there's two different aspects here: Having before_shmem_exit(), > and having cancel_before_shmem_exit(). We could just not have the > latter, and instead use a separate list for PG_ENSURE_ERROR_CLEANUP > internally. With the callback for PG_ENSURE_ERROR_CLEANUP calling those > from its private list. There's no other uses of > cancel_before_shmem_exit afaict. It's certainly arguable that PG_ENSURE_ERROR_CLEANUP is a special snowflake and needs to use a separate mechanism. What is not real clear to me is why there are any other callers that must use before_shmem_exit rather than on_shmem_exit --- IOW, except for P_E_E_C's use, I have never been persuaded that the former callback list should exist at all. The expectation for on_shmem_exit is that callbacks correspond to system service modules that are initialized in a particular order, and can safely be torn down in the reverse order. Why can't the existing callers just make even-later entries into that same callback list? regards, tom lane
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Robert Haas
Date:
On Mon, Aug 10, 2020 at 8:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's certainly arguable that PG_ENSURE_ERROR_CLEANUP is a special > snowflake and needs to use a separate mechanism. What is not real clear > to me is why there are any other callers that must use before_shmem_exit > rather than on_shmem_exit --- IOW, except for P_E_E_C's use, I have never > been persuaded that the former callback list should exist at all. The > expectation for on_shmem_exit is that callbacks correspond to system > service modules that are initialized in a particular order, and can safely > be torn down in the reverse order. Why can't the existing callers just > make even-later entries into that same callback list? That split dates to the parallel query work, and there are some comments in shmem_exit() about it; see in particular the explanation in the middle where it says "Call dynamic shared memory callbacks." It seemed to me that I needed the re-entrancy behavior that is described there, but for a set of callbacks that needed to run before some of the existing callbacks and after others. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Michael Paquier
Date:
On Mon, Aug 10, 2020 at 10:10:08PM -0400, Robert Haas wrote: > That split dates to the parallel query work, and there are some > comments in shmem_exit() about it; see in particular the explanation > in the middle where it says "Call dynamic shared memory callbacks." It > seemed to me that I needed the re-entrancy behavior that is described > there, but for a set of callbacks that needed to run before some of > the existing callbacks and after others. We still have a CF entry here: https://commitfest.postgresql.org/29/2649/ Is there still something that needs to absolutely be done here knowing that we have bab1500 that got rid of the root issue? Can the CF entry be marked as committed? (FWIW, I would move any discussion about improving more stuff related to shared memory cleanup code at proc exit into a new thread, as that looks like a new topic.) -- Michael
Attachment
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes: > Is there still something that needs to absolutely be done here knowing > that we have bab1500 that got rid of the root issue? Can the CF entry > be marked as committed? I think there is agreement that we're not going to change cancel_before_shmem_exit's restriction to only allow LIFO popping. So we should improve its comment to explain why. The other thing that seems legitimately on-the-table for this CF entry is whether we should change cancel_before_shmem_exit to complain, rather than silently do nothing, if it fails to pop the stack. Bharath's last patchset proposed to add an elog(DEBUG3) complaint, which seems to me to be just about entirely useless. I'd make it an ERROR, or maybe an Assert. regards, tom lane
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Bharath Rupireddy
Date:
On Mon, Sep 7, 2020 at 8:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I think there is agreement that we're not going to change > cancel_before_shmem_exit's restriction to only allow LIFO popping. > So we should improve its comment to explain why. The other thing > that seems legitimately on-the-table for this CF entry is whether > we should change cancel_before_shmem_exit to complain, rather than > silently do nothing, if it fails to pop the stack. Bharath's > last patchset proposed to add an elog(DEBUG3) complaint, which > seems to me to be just about entirely useless. I'd make it an > ERROR, or maybe an Assert. > Attaching a patch with both the comments modification and changing DEBUG3 to ERROR. make check and make world-check passes on this patch. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Tom Lane
Date:
Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes: > Attaching a patch with both the comments modification and changing > DEBUG3 to ERROR. make check and make world-check passes on this patch. I pushed this after simplifying the ereport down to an elog. I see no reason to consider this a user-facing error, so there's no need to make translators deal with the message. regards, tom lane
Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
From
Daniel Gustafsson
Date:
> On 30 Jul 2020, at 14:11, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jul 20, 2020 at 3:47 PM Robert Haas <robertmhaas@gmail.com> wrote: >> But I also agree that what pg_start_backup() was doing before v13 was >> wrong; that's why I committed >> 303640199d0436c5e7acdf50b837a027b5726594. The only reason I didn't >> back-patch it is because the consequences are so minor I didn't think >> it was worth worrying about. We could, though. I'd be somewhat >> inclined to both do that and also change LLVM to use on_proc_exit() in >> master, but I don't feel super-strongly about it. > > Unless somebody complains pretty soon, I'm going to go ahead and do > what is described above. When backpatching 9dce22033d5d I ran into this in v13 and below, since it needs llvm_shutdown to happen via on_proc_exit in order for all llvm_release_context calls to have finished. Unless anyone objects I will backpatch bab150045bd97 to v12 and v13 as part of my backpatch. -- Daniel Gustafsson