Thread: non-exclusive backup cleanup is mildly broken
While reviewing the first patch in Asif Rehman's series of patches for parallel backup over at http://postgr.es/m/CADM=Jeg3ZN+kPQpiSfeWCXr=xgpLrq4cBQE5ZviUxygKq3VqiA@mail.gmail.com I discovered that commit 7117685461af50f50c03f43e6a622284c8d54694 introduced a use of cancel_before_shmem_exit which falsifies the comment for that function. So you can cause a spurious WARNING in the logs by doing something like this, with max_prepared_transactions set to a non-zero value: select pg_start_backup('one', false, false); begin; prepare transaction 'nothing'; select pg_stop_backup(false); \q in the server log: WARNING: aborting backup due to backend exiting before pg_stop_backup was called And you can cause an assertion failure like this: select pg_start_backup('one', false, false); begin; prepare transaction 'nothing'; select pg_stop_backup(false); select pg_start_backup('two'); \q We've discussed before the idea that it might be good to remove the limitation that before_shmem_exit() can only remove the most-recently-added callback, which would be one way to fix this problem, but I'd like to propose an alternative solution which I think will work out more nicely for the patch mentioned above: don't use cancel_before_shmem_exit, and just leave the callback registered for the lifetime of the backend. That requires some adjustment of the callback, since it needs to tolerate exclusive backup mode being in effect. The attached patch takes that approach. Thoughts welcome on (1) the approach and (2) whether to back-patch. I think there's little doubt that this is formally a bug, but it's a pretty minor one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hello. At Wed, 11 Dec 2019 17:32:05 -0500, Robert Haas <robertmhaas@gmail.com> wrote in > While reviewing the first patch in Asif Rehman's series of patches for > parallel backup over at > http://postgr.es/m/CADM=Jeg3ZN+kPQpiSfeWCXr=xgpLrq4cBQE5ZviUxygKq3VqiA@mail.gmail.com > I discovered that commit 7117685461af50f50c03f43e6a622284c8d54694 > introduced a use of cancel_before_shmem_exit which falsifies the > comment for that function. So you can cause a spurious WARNING in the > logs by doing something like this, with max_prepared_transactions set > to a non-zero value: > > select pg_start_backup('one', false, false); > begin; > prepare transaction 'nothing'; > select pg_stop_backup(false); > \q > > in the server log: > WARNING: aborting backup due to backend exiting before pg_stop_backup > was called > > And you can cause an assertion failure like this: > > select pg_start_backup('one', false, false); > begin; > prepare transaction 'nothing'; > select pg_stop_backup(false); > select pg_start_backup('two'); > \q > > We've discussed before the idea that it might be good to remove the > limitation that before_shmem_exit() can only remove the > most-recently-added callback, which would be one way to fix this > problem, but I'd like to propose an alternative solution which I think > will work out more nicely for the patch mentioned above: don't use > cancel_before_shmem_exit, and just leave the callback registered for > the lifetime of the backend. That requires some adjustment of the > callback, since it needs to tolerate exclusive backup mode being in > effect. > > The attached patch takes that approach. Thoughts welcome on (1) the > approach and (2) whether to back-patch. I think there's little doubt > that this is formally a bug, but it's a pretty minor one. The direction seems reasonable, but the patch doesn't free up the before_shmem_exec slot nor avoid duplicate registration of the callback. Actually before_shmem_exit_list gets bloat with multiple do_pg_abort_backup entries through repeated rounds of non-exclusive backups. As the result, if one ends a session while a non-exclusive backup is active after closing the previous non-exclusive backup, do_pg_abort_backup aborts for assertion failure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Dec 12, 2019 at 5:58 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Hello.
At Wed, 11 Dec 2019 17:32:05 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
> While reviewing the first patch in Asif Rehman's series of patches for
> parallel backup over at
> http://postgr.es/m/CADM=Jeg3ZN+kPQpiSfeWCXr=xgpLrq4cBQE5ZviUxygKq3VqiA@mail.gmail.com
> I discovered that commit 7117685461af50f50c03f43e6a622284c8d54694
> introduced a use of cancel_before_shmem_exit which falsifies the
> comment for that function. So you can cause a spurious WARNING in the
> logs by doing something like this, with max_prepared_transactions set
> to a non-zero value:
>
> select pg_start_backup('one', false, false);
> begin;
> prepare transaction 'nothing';
> select pg_stop_backup(false);
> \q
>
> in the server log:
> WARNING: aborting backup due to backend exiting before pg_stop_backup
> was called
>
> And you can cause an assertion failure like this:
>
> select pg_start_backup('one', false, false);
> begin;
> prepare transaction 'nothing';
> select pg_stop_backup(false);
> select pg_start_backup('two');
> \q
>
> We've discussed before the idea that it might be good to remove the
> limitation that before_shmem_exit() can only remove the
> most-recently-added callback, which would be one way to fix this
> problem, but I'd like to propose an alternative solution which I think
> will work out more nicely for the patch mentioned above: don't use
> cancel_before_shmem_exit, and just leave the callback registered for
> the lifetime of the backend. That requires some adjustment of the
> callback, since it needs to tolerate exclusive backup mode being in
> effect.
>
> The attached patch takes that approach. Thoughts welcome on (1) the
> approach and (2) whether to back-patch. I think there's little doubt
> that this is formally a bug, but it's a pretty minor one.
The direction seems reasonable, but the patch doesn't free up the
before_shmem_exec slot nor avoid duplicate registration of the
callback. Actually before_shmem_exit_list gets bloat with multiple
do_pg_abort_backup entries through repeated rounds of non-exclusive
backups.
As the result, if one ends a session while a non-exclusive backup is
active after closing the previous non-exclusive backup,
do_pg_abort_backup aborts for assertion failure.
+1, I also think the direction seems perfectly reasonable, but we should avoid re-adding the callback since we're not removing it. Leaving it around seems cheap enough as long as there is only one.
My first reaction would be to just disallow the combination of prepared transactions and start/stop backups. But lookingat it it seems like an unnecessray restriction and an approach like this one seems better.
On Thu, Dec 12, 2019 at 01:52:31PM +0100, Magnus Hagander wrote: > On Thu, Dec 12, 2019 at 5:58 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> > wrote: >> The direction seems reasonable, but the patch doesn't free up the >> before_shmem_exec slot nor avoid duplicate registration of the >> callback. Actually before_shmem_exit_list gets bloat with multiple >> do_pg_abort_backup entries through repeated rounds of non-exclusive >> backups. >> >> As the result, if one ends a session while a non-exclusive backup is >> active after closing the previous non-exclusive backup, >> do_pg_abort_backup aborts for assertion failure. Agreed, that's an issue and do_pg_abort_abort should not touch sessionBackupState, so you should keep cancel_before_shmem_exit after pg_stop_backup(). Other than that, I have looked in details at how safe it is to move before_shmem_exit(do_pg_abort_backup) before do_pg_start_backup() and the cleanups of nonExclusiveBackups happen safely and consistently in the event of an error during pg_start_backup(). > +1, I also think the direction seems perfectly reasonable, but we should > avoid re-adding the callback since we're not removing it. Leaving it around > seems cheap enough as long as there is only one. + (errmsg("aborting backup due to backend exiting before pg_stop_back up was called"))); Not sure that pg_stop_back exists ;p > My first reaction would be to just disallow the combination of prepared > transactions and start/stop backups. But looking at it it seems like an > unnecessary restriction and an approach like this one seems better. I think that's a bad idea to put a restriction of this kind. There are large consumers of 2PC, and everybody needs backups. -- Michael
Attachment
On Fri, Dec 13, 2019 at 9:00 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Dec 12, 2019 at 01:52:31PM +0100, Magnus Hagander wrote:
> On Thu, Dec 12, 2019 at 5:58 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> wrote:
> My first reaction would be to just disallow the combination of prepared
> transactions and start/stop backups. But looking at it it seems like an
> unnecessary restriction and an approach like this one seems better.
I think that's a bad idea to put a restriction of this kind. There
are large consumers of 2PC, and everybody needs backups.
You misunderstood me. I certainly didn't mean that people who use 2PC shouldn't be able to use proper backups -- that would be *terrible*.
I meant disallowing pg_start_backup() in a session that had a prepared transaction, and disallowing preparing a transaction in a session with an ongoing backup. They would still work perfectly fine in *other* parallel sessions.
That said, being able to do it in the session itself is of course even better.
On 12/13/19 3:56 AM, Magnus Hagander wrote: > On Fri, Dec 13, 2019 at 9:00 AM Michael Paquier <michael@paquier.xyz > > I think that's a bad idea to put a restriction of this kind. There > are large consumers of 2PC, and everybody needs backups. > > > You misunderstood me. I certainly didn't mean that people who use 2PC > shouldn't be able to use proper backups -- that would be *terrible*. > > I meant disallowing pg_start_backup() in a session that had a prepared > transaction, and disallowing preparing a transaction in a session with > an ongoing backup. They would still work perfectly fine in *other* > parallel sessions. +1. I think it is reasonable to expect pg_start/stop_backup() to be performed in its own session without prepared transactions. +more if this concession keeps other aspects of the code simpler. -- -David david@pgmasters.net
At Fri, 13 Dec 2019 18:50:25 -0500, David Steele <david@pgmasters.net> wrote in > On 12/13/19 3:56 AM, Magnus Hagander wrote: > > On Fri, Dec 13, 2019 at 9:00 AM Michael Paquier <michael@paquier.xyz I > > think that's a bad idea to put a restriction of this kind. There > > are large consumers of 2PC, and everybody needs backups. > > You misunderstood me. I certainly didn't mean that people who use 2PC > > shouldn't be able to use proper backups -- that would be *terrible*. > > I meant disallowing pg_start_backup() in a session that had a prepared > > transaction, and disallowing preparing a transaction in a session with > > an ongoing backup. They would still work perfectly fine in *other* > > parallel sessions. > > +1. I think it is reasonable to expect pg_start/stop_backup() to be > performed in its own session without prepared transactions. > > +more if this concession keeps other aspects of the code simpler. However I don't object to the restriction, couldn't we allow the cancel_before_shmem_exit to search for the given entry looping over the before_shmem_exit array? If we don't do that, an assrtion is needed instead. Since pg_stop_backup_v2 is the only caller to the function in the whole server code, making cancel_before_shmem_exit a bit wiser (and slower) cannot hurt anyone. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Sun, Dec 15, 2019 at 8:44 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > However I don't object to the restriction, couldn't we allow the > cancel_before_shmem_exit to search for the given entry looping over > the before_shmem_exit array? If we don't do that, an assrtion is needed > instead. > > Since pg_stop_backup_v2 is the only caller to the function in the > whole server code, making cancel_before_shmem_exit a bit wiser (and > slower) cannot hurt anyone. That's actually not true. It's called from PG_END_ENSURE_ERROR_CLEANUP. Still, it wouldn't cost a lot to fix this that way. However, I think that it's better to fix it the other way, as I mentioned in my original email. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 17, 2019 at 4:19 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sun, Dec 15, 2019 at 8:44 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > However I don't object to the restriction, couldn't we allow the > > cancel_before_shmem_exit to search for the given entry looping over > > the before_shmem_exit array? If we don't do that, an assrtion is needed > > instead. > > > > Since pg_stop_backup_v2 is the only caller to the function in the > > whole server code, making cancel_before_shmem_exit a bit wiser (and > > slower) cannot hurt anyone. > > That's actually not true. It's called from > PG_END_ENSURE_ERROR_CLEANUP. Still, it wouldn't cost a lot to fix this > that way. However, I think that it's better to fix it the other way, > as I mentioned in my original email. +1 Not only PREPARE but also other commands that we may add in the future can cause the same issue, so it's better to address the root cause rather than working around by disallowing PREPARE. Regards, -- Fujii Masao
At Tue, 17 Dec 2019 11:46:03 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in > On Tue, Dec 17, 2019 at 4:19 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Sun, Dec 15, 2019 at 8:44 PM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > However I don't object to the restriction, couldn't we allow the > > > cancel_before_shmem_exit to search for the given entry looping over > > > the before_shmem_exit array? If we don't do that, an assrtion is needed > > > instead. > > > > > > Since pg_stop_backup_v2 is the only caller to the function in the > > > whole server code, making cancel_before_shmem_exit a bit wiser (and > > > slower) cannot hurt anyone. > > > > That's actually not true. It's called from > > PG_END_ENSURE_ERROR_CLEANUP. Still, it wouldn't cost a lot to fix this > > that way. However, I think that it's better to fix it the other way, > > as I mentioned in my original email. Sorry. I knew that. > +1 > > Not only PREPARE but also other commands that we may add in the future > can cause the same issue, so it's better to address the root cause rather > than working around by disallowing PREPARE. I stand on that side. I'm not sure what we are thinking as the root cause, but PREPARE is avoiding duplicate registration using the static bool twophaseExitRegistered and the most reasonable way to fix the crash of the current patch would be to do the same thing as PREPARE. The attached does that and changes the if condition of cancel_before_shmem_exit into assertion. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c index cac5854fcf..6f13084c43 100644 --- a/src/backend/access/transam/xlogfuncs.c +++ b/src/backend/access/transam/xlogfuncs.c @@ -80,6 +80,7 @@ pg_start_backup(PG_FUNCTION_ARGS) } else { + static bool exit_handler_registered = false; MemoryContext oldcontext; /* @@ -91,7 +92,12 @@ pg_start_backup(PG_FUNCTION_ARGS) tblspc_map_file = makeStringInfo(); MemoryContextSwitchTo(oldcontext); - before_shmem_exit(do_pg_abort_backup, BoolGetDatum(true)); + /* We don't remove this handler so register it only once */ + if (!exit_handler_registered) + { + before_shmem_exit(do_pg_abort_backup, BoolGetDatum(true)); + exit_handler_registered = true; + } startpoint = do_pg_start_backup(backupidstr, fast, NULL, label_file, NULL, tblspc_map_file, false, true); diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c index 05d02c23f5..44a1b24009 100644 --- a/src/backend/storage/ipc/ipc.c +++ b/src/backend/storage/ipc/ipc.c @@ -389,11 +389,11 @@ on_shmem_exit(pg_on_exit_callback function, Datum arg) void cancel_before_shmem_exit(pg_on_exit_callback function, Datum arg) { - if (before_shmem_exit_index > 0 && - before_shmem_exit_list[before_shmem_exit_index - 1].function - == function && - before_shmem_exit_list[before_shmem_exit_index - 1].arg == arg) - --before_shmem_exit_index; + Assert(before_shmem_exit_index > 0 && + before_shmem_exit_list[before_shmem_exit_index - 1].function + == function && + before_shmem_exit_list[before_shmem_exit_index - 1].arg == arg); + --before_shmem_exit_index; } /* ----------------------------------------------------------------
At Tue, 17 Dec 2019 15:11:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Tue, 17 Dec 2019 11:46:03 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in > PREPARE. The attached does that and changes the if condition of > cancel_before_shmem_exit into assertion. The patch can cause removal of a wrong cleanup function on non-cassert build. That might be unwanted. But I think the assertion is needed anyway. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Dec 17, 2019 at 1:31 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > The patch can cause removal of a wrong cleanup function on non-cassert > build. That might be unwanted. But I think the assertion is needed > anyway. I agree with the first part of this critique, but not necessarily with the second part. Right now, if you call cancel_before_shmem_exit(), you might not remove the handler that you intended to remove, but you won't remove some unrelated handler. With the patch, if assertions are disabled, you will definitely remove something, but it might not be the thing you intended to remove. That seems worse. On the question of whether the assertion is needed, it is currently the case that you could call cancel_before_shmem_exit() without knowing whether you'd actually registered a handler or not. With the proposed change, that would no longer be legal. Maybe that's OK. But it doesn't seem entirely crazy to have some error-recovery path where cancel_before_shmem_exit() could get called twice in obscure circumstances, and right now, you can rest easy, knowing that the second call just won't do anything. If we make it an assertion failure to do such things, then you can't. On the other hand, maybe there's no use for such a construct, and it'd be better to just reduce confusion. Anyway, I think this is a separate topic from fixing this specific problem. Since there doesn't seem to be any opposition to my original fix, except for the fact that I included a bug in it, I'm going to go see about getting that committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 13, 2019 at 3:00 AM Michael Paquier <michael@paquier.xyz> wrote: > Agreed, that's an issue and do_pg_abort_abort should not touch > sessionBackupState, so you should keep cancel_before_shmem_exit after > pg_stop_backup(). I don't understand this comment, because that can't possibly work. It assumes either that nobody else is allowed to use before_shmem_exit() after we do, or that cancel_before_shmem_exit() does something that it doesn't actually do. In general, before_shmem_exit() callbacks are intended to be persistent, and therefore it's the responsibility of the callback to test whether any work needs to be done. This particular callback is an exception, assuming that it can remove itself when there's no longer any work to be done. > Other than that, I have looked in details at how > safe it is to move before_shmem_exit(do_pg_abort_backup) before > do_pg_start_backup() and the cleanups of nonExclusiveBackups happen > safely and consistently in the event of an error during > pg_start_backup(). I came to the same conclusion, but I think it's still better to register the callback first. If the callback is properly written to do nothing when there's nothing to do, then having it registered earlier is harmless. And if, in the future, do_pg_start_backup() should be changed in such a way that, say, it can throw an error at the very end, then registering the handler first would prevent that from being a bug. It is generally more robust to register a cleanup handler in advance and then count on it to do the right thing than to try to write code that isn't allowed to fail in the wrong place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 17, 2019 at 8:38 AM Robert Haas <robertmhaas@gmail.com> wrote: > Since there doesn't seem to be any opposition to my original fix, > except for the fact that I included a bug in it, I'm going to go see > about getting that committed. Perhaps I spoke too soon: I'm not sure whether Michael's comments amount to an objection. While I give him a chance to respond, here's an updated patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > Perhaps I spoke too soon: I'm not sure whether Michael's comments > amount to an objection. While I give him a chance to respond, here's > an updated patch. Took a quick look. I agree that this seems a lot cleaner than the alternative proposals. I'd suggest however that the header comment for do_pg_abort_backup could do with more work, perhaps along the lines of "The odd-looking signature allows this to be registered directly as a shmem_exit handler". Personally I'd have kept the handler as a separate function that's just a one-line wrapper around "void do_pg_abort_backup(bool emit_warning)". We don't usually treat callbacks as functions to be also called in their own right. But if you don't want to do that, I'll settle for an acknowledgement of the hack in the comment. regards, tom lane
On Tue, Dec 17, 2019 at 7:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
> Perhaps I spoke too soon: I'm not sure whether Michael's comments
> amount to an objection. While I give him a chance to respond, here's
> an updated patch.
Took a quick look. I agree that this seems a lot cleaner than the
alternative proposals. I'd suggest however that the header comment
for do_pg_abort_backup could do with more work, perhaps along the
lines of "The odd-looking signature allows this to be registered
directly as a shmem_exit handler".
Personally I'd have kept the handler as a separate function that's just
a one-line wrapper around "void do_pg_abort_backup(bool emit_warning)".
We don't usually treat callbacks as functions to be also called in
their own right. But if you don't want to do that, I'll settle for an
acknowledgement of the hack in the comment.
As would I, but I'm also fine with either of the two ways.
I wrote: > Took a quick look. I agree that this seems a lot cleaner than the > alternative proposals. I'd suggest however that the header comment > for do_pg_abort_backup could do with more work, perhaps along the > lines of "The odd-looking signature allows this to be registered > directly as a shmem_exit handler". > Personally I'd have kept the handler as a separate function that's just > a one-line wrapper around "void do_pg_abort_backup(bool emit_warning)". > We don't usually treat callbacks as functions to be also called in > their own right. But if you don't want to do that, I'll settle for an > acknowledgement of the hack in the comment. Oh, scratch that --- looking closer, I see that the only two use-cases in the patched code are via before_shmem_exit and PG_ENSURE_ERROR_CLEANUP, and both of those require a function with the signature of an on_exit callback. So there's no need for a separate wrapper because this isn't going to be called any other way. I still recommend amending the comment to explain why it has this signature, though. regards, tom lane
On Tue, Dec 17, 2019 at 12:52:05PM -0500, Robert Haas wrote: > On Tue, Dec 17, 2019 at 8:38 AM Robert Haas <robertmhaas@gmail.com> wrote: >> Since there doesn't seem to be any opposition to my original fix, >> except for the fact that I included a bug in it, I'm going to go see >> about getting that committed. > > Perhaps I spoke too soon: I'm not sure whether Michael's comments > amount to an objection. While I give him a chance to respond, here's > an updated patch. stoppoint = do_pg_stop_backup(label_file->data, waitforarchive, NULL); - cancel_before_shmem_exit(nonexclusive_base_backup_cleanup, (Datum) 0); [...] +void +register_persistent_abort_backup_handler(void) +{ + static bool already_done = false; + + if (already_done) + return; So that's how you prevent piling up multiple registrations of this callback compared to v1. FWIW, I think that it is a cleaner approach to remove the callback once a non-exclusive backup is done, because a session has no need for it once it is done with its non-exclusive backup, and this session may remain around for some time. + if (emit_warning) + ereport(WARNING, + (errmsg("aborting backup due to backend exiting before pg_stop_back up was called"))); This warning is incorrect => "pg_stop_back up". (Mentioned upthread.) -- Michael
Attachment
On Tue, Dec 17, 2019 at 6:40 PM Michael Paquier <michael@paquier.xyz> wrote: > So that's how you prevent piling up multiple registrations of this > callback compared to v1. FWIW, I think that it is a cleaner approach > to remove the callback once a non-exclusive backup is done, because a > session has no need for it once it is done with its non-exclusive > backup, and this session may remain around for some time. The fact that the session may remain around for some time isn't really relevant, because the callback isn't consuming any resources. It does not increase memory usage by a single byte, nor CPU consumption either. It does consume a few CPU cycles when the backend finally exits, but the number of such cycles is very small and unrelated to the length of the session. And removing the callback isn't entirely free, either. I think the real point for me is that it's bad to have two sources of truth. We have the sessionBackupState variable that tells us whether we're in a backup, and then we separately have whether or not the callback is registered. If those two things ever get out of sync, as they do in the test cases that I've proposed, then we have problems -- so it's better not to maintain the state in two separate ways. The way it's set up right now actually seems quite fragile even apart from the problem with cancel_before_shmem_exit(). do_pg_stop_backup() sets sessionBackupState to SESSION_BACKUP_NONE and then does things that might fail. If they do, then the cancel_before_shmem_exit() callback will leave the callback installed, which can lead to a spurious warning or assertion failure later as in the original report. The only way to avoid that problem would be to move the cancel_before_shmem_exit() callback so that it happens right next to setting sessionBackupState to SESSION_BACKUP_NONE -- note the comments there saying we can't even do WALInsertLockRelease() between updating XLogCtl and updating sessionBackupState. But that would be very ugly, because we'd then have to pass a flag to do_pg_stop_backup() saying whether to remove the callback, since only one caller wants that behavior. And, we'd have to change cancel_before_shmem_exit() to search the whole array, which would almost certainly cost more cycles than leaving a do-nothing callback around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 18, 2019 at 12:19 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Dec 17, 2019 at 6:40 PM Michael Paquier <michael@paquier.xyz> wrote: > > So that's how you prevent piling up multiple registrations of this > > callback compared to v1. FWIW, I think that it is a cleaner approach > > to remove the callback once a non-exclusive backup is done, because a > > session has no need for it once it is done with its non-exclusive > > backup, and this session may remain around for some time. > > The fact that the session may remain around for some time isn't really > relevant, because the callback isn't consuming any resources. It does > not increase memory usage by a single byte, nor CPU consumption > either. It does consume a few CPU cycles when the backend finally > exits, but the number of such cycles is very small and unrelated to > the length of the session. And removing the callback isn't entirely > free, either. > > I think the real point for me is that it's bad to have two sources of > truth. We have the sessionBackupState variable that tells us whether > we're in a backup, and then we separately have whether or not the > callback is registered. If those two things ever get out of sync, as > they do in the test cases that I've proposed, then we have problems -- > so it's better not to maintain the state in two separate ways. > > The way it's set up right now actually seems quite fragile even apart > from the problem with cancel_before_shmem_exit(). do_pg_stop_backup() > sets sessionBackupState to SESSION_BACKUP_NONE and then does things > that might fail. If they do, then the cancel_before_shmem_exit() > callback will leave the callback installed, which can lead to a > spurious warning or assertion failure later as in the original report. > The only way to avoid that problem would be to move the > cancel_before_shmem_exit() callback so that it happens right next to > setting sessionBackupState to SESSION_BACKUP_NONE -- note the comments > there saying we can't even do WALInsertLockRelease() between updating > XLogCtl and updating sessionBackupState. But that would be very ugly, > because we'd then have to pass a flag to do_pg_stop_backup() saying > whether to remove the callback, since only one caller wants that > behavior. > > And, we'd have to change cancel_before_shmem_exit() to search the > whole array, which would almost certainly cost more cycles than > leaving a do-nothing callback around. If pg_abort_backup callback function can be called safely even when the backup is not in progress, we can just use the global variable like pg_abort_backup_registered to register the callback function only on first call. In this way, cancel_before_shmem_exit() doesn't need to search the array to get rid of the function. Regards, -- Fujii Masao
At Tue, 17 Dec 2019 15:48:40 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in > I wrote: > > Took a quick look. I agree that this seems a lot cleaner than the > > alternative proposals. I'd suggest however that the header comment > > for do_pg_abort_backup could do with more work, perhaps along the > > lines of "The odd-looking signature allows this to be registered > > directly as a shmem_exit handler". > > > Personally I'd have kept the handler as a separate function that's just > > a one-line wrapper around "void do_pg_abort_backup(bool emit_warning)". > > We don't usually treat callbacks as functions to be also called in > > their own right. But if you don't want to do that, I'll settle for an > > acknowledgement of the hack in the comment. > > Oh, scratch that --- looking closer, I see that the only two use-cases in > the patched code are via before_shmem_exit and PG_ENSURE_ERROR_CLEANUP, > and both of those require a function with the signature of an on_exit > callback. So there's no need for a separate wrapper because this isn't > going to be called any other way. I still recommend amending the > comment to explain why it has this signature, though. The existing comment of do_pg_abort_backup follows seems to be a bit stale to me. I think it can be revised that way. * NB: This is only for aborting a non-exclusive backup that doesn't write * backup_label. A backup started with pg_start_backup() needs to be finished * with pg_stop_backup(). Don't we need a regression test for this behavior? + * Register a handler that will warn about unterminated backups at end of + * session, unless this has already been done. Though I'm not sure of the necessity to do, it might need to mention cleaning up, which is actually done by the function. Other than the above looks good to me. The patch applies cleanly and works as expected. In another branch of this thread, At Tue, 17 Dec 2019 08:38:13 -0500, Robert Haas <robertmhaas@gmail.com> wrote in > On the question of whether the assertion is needed, it is currently > the case that you could call cancel_before_shmem_exit() without > knowing whether you'd actually registered a handler or not. With the > proposed change, that would no longer be legal. Maybe that's OK. But > it doesn't seem entirely crazy to have some error-recovery path where > cancel_before_shmem_exit() could get called twice in obscure > circumstances, and right now, you can rest easy, knowing that the > second call just won't do anything. If we make it an assertion failure > to do such things, then you can't. On the other hand, maybe there's no > use for such a construct, and it'd be better to just reduce confusion. > Anyway, I think this is a separate topic from fixing this specific > problem. The part is not for the case "without knowing whether you'd actually registered a handler or not", but for the case "without knowing whether someone could have registered a handler or not after the registration I made". But there is not so much difference and I don't insist on that because of the reason you mentioned as above. Thanks for elaborating. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Dec 17, 2019 at 11:36 PM Fujii Masao <masao.fujii@gmail.com> wrote: > If pg_abort_backup callback function can be called safely even when > the backup is not in progress, we can just use the global variable like > pg_abort_backup_registered to register the callback function only > on first call. In this way, cancel_before_shmem_exit() doesn't need to > search the array to get rid of the function. Right. That's how the proposed patch works. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 18, 2019 at 07:43:43AM -0500, Robert Haas wrote: > On Tue, Dec 17, 2019 at 11:36 PM Fujii Masao <masao.fujii@gmail.com> wrote: >> If pg_abort_backup callback function can be called safely even when >> the backup is not in progress, we can just use the global variable like >> pg_abort_backup_registered to register the callback function only >> on first call. In this way, cancel_before_shmem_exit() doesn't need to >> search the array to get rid of the function. > > Right. That's how the proposed patch works. Well, it seems like I am a poor lonesome cowboy on this one then. And what you are proposing does not break things either as far as I checked, so I'll just go hide in a corner :) -- Michael
Attachment
On Tue, Dec 17, 2019 at 3:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Oh, scratch that --- looking closer, I see that the only two use-cases in > the patched code are via before_shmem_exit and PG_ENSURE_ERROR_CLEANUP, > and both of those require a function with the signature of an on_exit > callback. Yeah, that's why I was surprised that you wanted shim functions. > So there's no need for a separate wrapper because this isn't > going to be called any other way. I still recommend amending the > comment to explain why it has this signature, though. Done, and committed. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company