Thread: non-exclusive backup cleanup is mildly broken

non-exclusive backup cleanup is mildly broken

From
Robert Haas
Date:
While reviewing the first patch in Asif Rehman's series of patches for
parallel backup over at
http://postgr.es/m/CADM=Jeg3ZN+kPQpiSfeWCXr=xgpLrq4cBQE5ZviUxygKq3VqiA@mail.gmail.com
I discovered that commit 7117685461af50f50c03f43e6a622284c8d54694
introduced a use of cancel_before_shmem_exit which falsifies the
comment for that function. So you can cause a spurious WARNING in the
logs by doing something like this, with max_prepared_transactions set
to a non-zero value:

select pg_start_backup('one', false, false);
begin;
prepare transaction 'nothing';
select pg_stop_backup(false);
\q

in the server log:
WARNING:  aborting backup due to backend exiting before pg_stop_backup
was called

And you can cause an assertion failure like this:

select pg_start_backup('one', false, false);
begin;
prepare transaction 'nothing';
select pg_stop_backup(false);
select pg_start_backup('two');
\q

We've discussed before the idea that it might be good to remove the
limitation that before_shmem_exit() can only remove the
most-recently-added callback, which would be one way to fix this
problem, but I'd like to propose an alternative solution which I think
will work out more nicely for the patch mentioned above: don't use
cancel_before_shmem_exit, and just leave the callback registered for
the lifetime of the backend. That requires some adjustment of the
callback, since it needs to tolerate exclusive backup mode being in
effect.

The attached patch takes that approach. Thoughts welcome on (1) the
approach and (2) whether to back-patch. I think there's little doubt
that this is formally a bug, but it's a pretty minor one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: non-exclusive backup cleanup is mildly broken

From
Kyotaro Horiguchi
Date:
Hello.

At Wed, 11 Dec 2019 17:32:05 -0500, Robert Haas <robertmhaas@gmail.com> wrote in 
> While reviewing the first patch in Asif Rehman's series of patches for
> parallel backup over at
> http://postgr.es/m/CADM=Jeg3ZN+kPQpiSfeWCXr=xgpLrq4cBQE5ZviUxygKq3VqiA@mail.gmail.com
> I discovered that commit 7117685461af50f50c03f43e6a622284c8d54694
> introduced a use of cancel_before_shmem_exit which falsifies the
> comment for that function. So you can cause a spurious WARNING in the
> logs by doing something like this, with max_prepared_transactions set
> to a non-zero value:
> 
> select pg_start_backup('one', false, false);
> begin;
> prepare transaction 'nothing';
> select pg_stop_backup(false);
> \q
> 
> in the server log:
> WARNING:  aborting backup due to backend exiting before pg_stop_backup
> was called
> 
> And you can cause an assertion failure like this:
> 
> select pg_start_backup('one', false, false);
> begin;
> prepare transaction 'nothing';
> select pg_stop_backup(false);
> select pg_start_backup('two');
> \q
> 
> We've discussed before the idea that it might be good to remove the
> limitation that before_shmem_exit() can only remove the
> most-recently-added callback, which would be one way to fix this
> problem, but I'd like to propose an alternative solution which I think
> will work out more nicely for the patch mentioned above: don't use
> cancel_before_shmem_exit, and just leave the callback registered for
> the lifetime of the backend. That requires some adjustment of the
> callback, since it needs to tolerate exclusive backup mode being in
> effect.
> 
> The attached patch takes that approach. Thoughts welcome on (1) the
> approach and (2) whether to back-patch. I think there's little doubt
> that this is formally a bug, but it's a pretty minor one.

The direction seems reasonable, but the patch doesn't free up the
before_shmem_exec slot nor avoid duplicate registration of the
callback. Actually before_shmem_exit_list gets bloat with multiple
do_pg_abort_backup entries through repeated rounds of non-exclusive
backups.

As the result, if one ends a session while a non-exclusive backup is
active after closing the previous non-exclusive backup,
do_pg_abort_backup aborts for assertion failure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: non-exclusive backup cleanup is mildly broken

From
Magnus Hagander
Date:
On Thu, Dec 12, 2019 at 5:58 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Hello.

At Wed, 11 Dec 2019 17:32:05 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
> While reviewing the first patch in Asif Rehman's series of patches for
> parallel backup over at
> http://postgr.es/m/CADM=Jeg3ZN+kPQpiSfeWCXr=xgpLrq4cBQE5ZviUxygKq3VqiA@mail.gmail.com
> I discovered that commit 7117685461af50f50c03f43e6a622284c8d54694
> introduced a use of cancel_before_shmem_exit which falsifies the
> comment for that function. So you can cause a spurious WARNING in the
> logs by doing something like this, with max_prepared_transactions set
> to a non-zero value:
>
> select pg_start_backup('one', false, false);
> begin;
> prepare transaction 'nothing';
> select pg_stop_backup(false);
> \q
>
> in the server log:
> WARNING:  aborting backup due to backend exiting before pg_stop_backup
> was called
>
> And you can cause an assertion failure like this:
>
> select pg_start_backup('one', false, false);
> begin;
> prepare transaction 'nothing';
> select pg_stop_backup(false);
> select pg_start_backup('two');
> \q
>
> We've discussed before the idea that it might be good to remove the
> limitation that before_shmem_exit() can only remove the
> most-recently-added callback, which would be one way to fix this
> problem, but I'd like to propose an alternative solution which I think
> will work out more nicely for the patch mentioned above: don't use
> cancel_before_shmem_exit, and just leave the callback registered for
> the lifetime of the backend. That requires some adjustment of the
> callback, since it needs to tolerate exclusive backup mode being in
> effect.
>
> The attached patch takes that approach. Thoughts welcome on (1) the
> approach and (2) whether to back-patch. I think there's little doubt
> that this is formally a bug, but it's a pretty minor one.

The direction seems reasonable, but the patch doesn't free up the
before_shmem_exec slot nor avoid duplicate registration of the
callback. Actually before_shmem_exit_list gets bloat with multiple
do_pg_abort_backup entries through repeated rounds of non-exclusive
backups.

As the result, if one ends a session while a non-exclusive backup is
active after closing the previous non-exclusive backup,
do_pg_abort_backup aborts for assertion failure.

+1, I also think the direction seems perfectly reasonable, but we should avoid re-adding the callback since we're not removing it. Leaving it around seems cheap enough as long as there is only one.

My first reaction would be to just disallow the combination of prepared transactions and start/stop backups. But lookingat it it seems like an unnecessray restriction and an approach like this one seems better. 

--

Re: non-exclusive backup cleanup is mildly broken

From
Michael Paquier
Date:
On Thu, Dec 12, 2019 at 01:52:31PM +0100, Magnus Hagander wrote:
> On Thu, Dec 12, 2019 at 5:58 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> wrote:
>> The direction seems reasonable, but the patch doesn't free up the
>> before_shmem_exec slot nor avoid duplicate registration of the
>> callback. Actually before_shmem_exit_list gets bloat with multiple
>> do_pg_abort_backup entries through repeated rounds of non-exclusive
>> backups.
>>
>> As the result, if one ends a session while a non-exclusive backup is
>> active after closing the previous non-exclusive backup,
>> do_pg_abort_backup aborts for assertion failure.

Agreed, that's an issue and do_pg_abort_abort should not touch
sessionBackupState, so you should keep cancel_before_shmem_exit after
pg_stop_backup().  Other than that, I have looked in details at how
safe it is to move before_shmem_exit(do_pg_abort_backup) before
do_pg_start_backup() and the cleanups of nonExclusiveBackups happen
safely and consistently in the event of an error during
pg_start_backup().

> +1, I also think the direction seems perfectly reasonable, but we should
> avoid re-adding the callback since we're not removing it. Leaving it around
> seems cheap enough as long as there is only one.

+     (errmsg("aborting backup due to backend exiting before
pg_stop_back up was called")));
Not sure that pg_stop_back exists ;p

> My first reaction would be to just disallow the combination of prepared
> transactions and start/stop backups. But looking at it it seems like an
> unnecessary restriction and an approach like this one seems better.

I think that's a bad idea to put a restriction of this kind.  There
are large consumers of 2PC, and everybody needs backups.
--
Michael

Attachment

Re: non-exclusive backup cleanup is mildly broken

From
Magnus Hagander
Date:


On Fri, Dec 13, 2019 at 9:00 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Dec 12, 2019 at 01:52:31PM +0100, Magnus Hagander wrote:
> On Thu, Dec 12, 2019 at 5:58 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> wrote:

> My first reaction would be to just disallow the combination of prepared
> transactions and start/stop backups. But looking at it it seems like an
> unnecessary restriction and an approach like this one seems better.

I think that's a bad idea to put a restriction of this kind.  There
are large consumers of 2PC, and everybody needs backups.

You misunderstood me. I certainly didn't mean that people who use 2PC shouldn't be able to use proper backups -- that would be *terrible*.

I meant disallowing pg_start_backup() in a session that had a prepared transaction, and disallowing preparing a transaction in a session with an ongoing backup. They would still work perfectly fine in *other* parallel sessions.

That said, being able to do it in the session itself is of course even better. 

--

Re: non-exclusive backup cleanup is mildly broken

From
David Steele
Date:
On 12/13/19 3:56 AM, Magnus Hagander wrote:
> On Fri, Dec 13, 2019 at 9:00 AM Michael Paquier <michael@paquier.xyz 
> 
>     I think that's a bad idea to put a restriction of this kind.  There
>     are large consumers of 2PC, and everybody needs backups.
> 
> 
> You misunderstood me. I certainly didn't mean that people who use 2PC 
> shouldn't be able to use proper backups -- that would be *terrible*.
> 
> I meant disallowing pg_start_backup() in a session that had a prepared 
> transaction, and disallowing preparing a transaction in a session with 
> an ongoing backup. They would still work perfectly fine in *other* 
> parallel sessions.

+1.  I think it is reasonable to expect pg_start/stop_backup() to be 
performed in its own session without prepared transactions.

+more if this concession keeps other aspects of the code simpler.

-- 
-David
david@pgmasters.net



Re: non-exclusive backup cleanup is mildly broken

From
Kyotaro Horiguchi
Date:
At Fri, 13 Dec 2019 18:50:25 -0500, David Steele <david@pgmasters.net> wrote in
> On 12/13/19 3:56 AM, Magnus Hagander wrote:
> > On Fri, Dec 13, 2019 at 9:00 AM Michael Paquier <michael@paquier.xyz I
> > think that's a bad idea to put a restriction of this kind.  There
> >     are large consumers of 2PC, and everybody needs backups.
> > You misunderstood me. I certainly didn't mean that people who use 2PC
> > shouldn't be able to use proper backups -- that would be *terrible*.
> > I meant disallowing pg_start_backup() in a session that had a prepared
> > transaction, and disallowing preparing a transaction in a session with
> > an ongoing backup. They would still work perfectly fine in *other*
> > parallel sessions.
>
> +1.  I think it is reasonable to expect pg_start/stop_backup() to be
> performed in its own session without prepared transactions.
>
> +more if this concession keeps other aspects of the code simpler.

However I don't object to the restriction, couldn't we allow the
cancel_before_shmem_exit to search for the given entry looping over
the before_shmem_exit array? If we don't do that, an assrtion is needed
instead.

Since pg_stop_backup_v2 is the only caller to the function in the
whole server code, making cancel_before_shmem_exit a bit wiser (and
slower) cannot hurt anyone.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center



Re: non-exclusive backup cleanup is mildly broken

From
Robert Haas
Date:
On Sun, Dec 15, 2019 at 8:44 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> However I don't object to the restriction, couldn't we allow the
> cancel_before_shmem_exit to search for the given entry looping over
> the before_shmem_exit array? If we don't do that, an assrtion is needed
> instead.
>
> Since pg_stop_backup_v2 is the only caller to the function in the
> whole server code, making cancel_before_shmem_exit a bit wiser (and
> slower) cannot hurt anyone.

That's actually not true. It's called from
PG_END_ENSURE_ERROR_CLEANUP. Still, it wouldn't cost a lot to fix this
that way. However, I think that it's better to fix it the other way,
as I mentioned in my original email.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: non-exclusive backup cleanup is mildly broken

From
Fujii Masao
Date:
On Tue, Dec 17, 2019 at 4:19 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Dec 15, 2019 at 8:44 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > However I don't object to the restriction, couldn't we allow the
> > cancel_before_shmem_exit to search for the given entry looping over
> > the before_shmem_exit array? If we don't do that, an assrtion is needed
> > instead.
> >
> > Since pg_stop_backup_v2 is the only caller to the function in the
> > whole server code, making cancel_before_shmem_exit a bit wiser (and
> > slower) cannot hurt anyone.
>
> That's actually not true. It's called from
> PG_END_ENSURE_ERROR_CLEANUP. Still, it wouldn't cost a lot to fix this
> that way. However, I think that it's better to fix it the other way,
> as I mentioned in my original email.

+1

Not only PREPARE but also other commands that we may add in the future
can cause the same issue, so it's better to address the root cause rather
than working around by disallowing PREPARE.

Regards,

-- 
Fujii Masao



Re: non-exclusive backup cleanup is mildly broken

From
Kyotaro Horiguchi
Date:
At Tue, 17 Dec 2019 11:46:03 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in 
> On Tue, Dec 17, 2019 at 4:19 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Sun, Dec 15, 2019 at 8:44 PM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > > However I don't object to the restriction, couldn't we allow the
> > > cancel_before_shmem_exit to search for the given entry looping over
> > > the before_shmem_exit array? If we don't do that, an assrtion is needed
> > > instead.
> > >
> > > Since pg_stop_backup_v2 is the only caller to the function in the
> > > whole server code, making cancel_before_shmem_exit a bit wiser (and
> > > slower) cannot hurt anyone.
> >
> > That's actually not true. It's called from
> > PG_END_ENSURE_ERROR_CLEANUP. Still, it wouldn't cost a lot to fix this
> > that way. However, I think that it's better to fix it the other way,
> > as I mentioned in my original email.

Sorry. I knew that.

> +1
> 
> Not only PREPARE but also other commands that we may add in the future
> can cause the same issue, so it's better to address the root cause rather
> than working around by disallowing PREPARE.

I stand on that side. I'm not sure what we are thinking as the root
cause, but PREPARE is avoiding duplicate registration using the static
bool twophaseExitRegistered and the most reasonable way to fix the
crash of the current patch would be to do the same thing as
PREPARE. The attached does that and changes the if condition of
cancel_before_shmem_exit into assertion.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index cac5854fcf..6f13084c43 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -80,6 +80,7 @@ pg_start_backup(PG_FUNCTION_ARGS)
     }
     else
     {
+        static bool exit_handler_registered = false;
         MemoryContext oldcontext;
 
         /*
@@ -91,7 +92,12 @@ pg_start_backup(PG_FUNCTION_ARGS)
         tblspc_map_file = makeStringInfo();
         MemoryContextSwitchTo(oldcontext);
 
-        before_shmem_exit(do_pg_abort_backup, BoolGetDatum(true));
+        /* We don't remove this handler so register it only once */
+        if (!exit_handler_registered)
+        {
+            before_shmem_exit(do_pg_abort_backup, BoolGetDatum(true));
+            exit_handler_registered = true;
+        }
 
         startpoint = do_pg_start_backup(backupidstr, fast, NULL, label_file,
                                         NULL, tblspc_map_file, false, true);
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 05d02c23f5..44a1b24009 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -389,11 +389,11 @@ on_shmem_exit(pg_on_exit_callback function, Datum arg)
 void
 cancel_before_shmem_exit(pg_on_exit_callback function, Datum arg)
 {
-    if (before_shmem_exit_index > 0 &&
-        before_shmem_exit_list[before_shmem_exit_index - 1].function
-        == function &&
-        before_shmem_exit_list[before_shmem_exit_index - 1].arg == arg)
-        --before_shmem_exit_index;
+    Assert(before_shmem_exit_index > 0 &&
+           before_shmem_exit_list[before_shmem_exit_index - 1].function
+           == function &&
+           before_shmem_exit_list[before_shmem_exit_index - 1].arg == arg);
+    --before_shmem_exit_index;
 }
 
 /* ----------------------------------------------------------------

Re: non-exclusive backup cleanup is mildly broken

From
Kyotaro Horiguchi
Date:
At Tue, 17 Dec 2019 15:11:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Tue, 17 Dec 2019 11:46:03 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in 
> PREPARE. The attached does that and changes the if condition of
> cancel_before_shmem_exit into assertion.

The patch can cause removal of a wrong cleanup function on non-cassert
build. That might be unwanted. But I think the assertion is needed
anyway.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: non-exclusive backup cleanup is mildly broken

From
Robert Haas
Date:
On Tue, Dec 17, 2019 at 1:31 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> The patch can cause removal of a wrong cleanup function on non-cassert
> build. That might be unwanted. But I think the assertion is needed
> anyway.

I agree with the first part of this critique, but not necessarily with
the second part. Right now, if you call cancel_before_shmem_exit(),
you might not remove the handler that you intended to remove, but you
won't remove some unrelated handler. With the patch, if assertions are
disabled, you will definitely remove something, but it might not be
the thing you intended to remove. That seems worse.

On the question of whether the assertion is needed, it is currently
the case that you could call cancel_before_shmem_exit() without
knowing whether you'd actually registered a handler or not. With the
proposed change, that would no longer be legal. Maybe that's OK. But
it doesn't seem entirely crazy to have some error-recovery path where
cancel_before_shmem_exit() could get called twice in obscure
circumstances, and right now, you can rest easy, knowing that the
second call just won't do anything. If we make it an assertion failure
to do such things, then you can't. On the other hand, maybe there's no
use for such a construct, and it'd be better to just reduce confusion.
Anyway, I think this is a separate topic from fixing this specific
problem.

Since there doesn't seem to be any opposition to my original fix,
except for the fact that I included a bug in it, I'm going to go see
about getting that committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: non-exclusive backup cleanup is mildly broken

From
Robert Haas
Date:
On Fri, Dec 13, 2019 at 3:00 AM Michael Paquier <michael@paquier.xyz> wrote:
> Agreed, that's an issue and do_pg_abort_abort should not touch
> sessionBackupState, so you should keep cancel_before_shmem_exit after
> pg_stop_backup().

I don't understand this comment, because that can't possibly work.  It
assumes either that nobody else is allowed to use before_shmem_exit()
after we do, or that cancel_before_shmem_exit() does something that it
doesn't actually do.

In general, before_shmem_exit() callbacks are intended to be
persistent, and therefore it's the responsibility of the callback to
test whether any work needs to be done. This particular callback is an
exception, assuming that it can remove itself when there's no longer
any work to be done.

> Other than that, I have looked in details at how
> safe it is to move before_shmem_exit(do_pg_abort_backup) before
> do_pg_start_backup() and the cleanups of nonExclusiveBackups happen
> safely and consistently in the event of an error during
> pg_start_backup().

I came to the same conclusion, but I think it's still better to
register the callback first. If the callback is properly written to do
nothing when there's nothing to do, then having it registered earlier
is harmless. And if, in the future, do_pg_start_backup() should be
changed in such a way that, say, it can throw an error at the very
end, then registering the handler first would prevent that from being
a bug.

It is generally more robust to register a cleanup handler in advance
and then count on it to do the right thing than to try to write code
that isn't allowed to fail in the wrong place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: non-exclusive backup cleanup is mildly broken

From
Robert Haas
Date:
On Tue, Dec 17, 2019 at 8:38 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Since there doesn't seem to be any opposition to my original fix,
> except for the fact that I included a bug in it, I'm going to go see
> about getting that committed.

Perhaps I spoke too soon: I'm not sure whether Michael's comments
amount to an objection. While I give him a chance to respond, here's
an updated patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: non-exclusive backup cleanup is mildly broken

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Perhaps I spoke too soon: I'm not sure whether Michael's comments
> amount to an objection. While I give him a chance to respond, here's
> an updated patch.

Took a quick look.  I agree that this seems a lot cleaner than the
alternative proposals.  I'd suggest however that the header comment
for do_pg_abort_backup could do with more work, perhaps along the
lines of "The odd-looking signature allows this to be registered
directly as a shmem_exit handler".

Personally I'd have kept the handler as a separate function that's just
a one-line wrapper around "void do_pg_abort_backup(bool emit_warning)".
We don't usually treat callbacks as functions to be also called in
their own right.  But if you don't want to do that, I'll settle for an
acknowledgement of the hack in the comment.

            regards, tom lane



Re: non-exclusive backup cleanup is mildly broken

From
Magnus Hagander
Date:

On Tue, Dec 17, 2019 at 7:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
> Perhaps I spoke too soon: I'm not sure whether Michael's comments
> amount to an objection. While I give him a chance to respond, here's
> an updated patch.

Took a quick look.  I agree that this seems a lot cleaner than the
alternative proposals.  I'd suggest however that the header comment
for do_pg_abort_backup could do with more work, perhaps along the
lines of "The odd-looking signature allows this to be registered
directly as a shmem_exit handler".

Personally I'd have kept the handler as a separate function that's just
a one-line wrapper around "void do_pg_abort_backup(bool emit_warning)".
We don't usually treat callbacks as functions to be also called in
their own right.  But if you don't want to do that, I'll settle for an
acknowledgement of the hack in the comment.

As would I, but I'm also fine with either of the two ways.
 
--

Re: non-exclusive backup cleanup is mildly broken

From
Tom Lane
Date:
I wrote:
> Took a quick look.  I agree that this seems a lot cleaner than the
> alternative proposals.  I'd suggest however that the header comment
> for do_pg_abort_backup could do with more work, perhaps along the
> lines of "The odd-looking signature allows this to be registered
> directly as a shmem_exit handler".

> Personally I'd have kept the handler as a separate function that's just
> a one-line wrapper around "void do_pg_abort_backup(bool emit_warning)".
> We don't usually treat callbacks as functions to be also called in
> their own right.  But if you don't want to do that, I'll settle for an
> acknowledgement of the hack in the comment.

Oh, scratch that --- looking closer, I see that the only two use-cases in
the patched code are via before_shmem_exit and PG_ENSURE_ERROR_CLEANUP,
and both of those require a function with the signature of an on_exit
callback.  So there's no need for a separate wrapper because this isn't
going to be called any other way.  I still recommend amending the
comment to explain why it has this signature, though.

            regards, tom lane



Re: non-exclusive backup cleanup is mildly broken

From
Michael Paquier
Date:
On Tue, Dec 17, 2019 at 12:52:05PM -0500, Robert Haas wrote:
> On Tue, Dec 17, 2019 at 8:38 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> Since there doesn't seem to be any opposition to my original fix,
>> except for the fact that I included a bug in it, I'm going to go see
>> about getting that committed.
>
> Perhaps I spoke too soon: I'm not sure whether Michael's comments
> amount to an objection. While I give him a chance to respond, here's
> an updated patch.

    stoppoint = do_pg_stop_backup(label_file->data, waitforarchive, NULL);
-   cancel_before_shmem_exit(nonexclusive_base_backup_cleanup, (Datum) 0);
[...]
+void
+register_persistent_abort_backup_handler(void)
+{
+   static bool already_done = false;
+
+   if (already_done)
+       return;
So that's how you prevent piling up multiple registrations of this
callback compared to v1.  FWIW, I think that it is a cleaner approach
to remove the callback once a non-exclusive backup is done, because a
session has no need for it once it is done with its non-exclusive
backup, and this session may remain around for some time.

+   if (emit_warning)
+       ereport(WARNING,
+               (errmsg("aborting backup due to backend exiting before
pg_stop_back up was called")));
This warning is incorrect => "pg_stop_back up".  (Mentioned upthread.)
--
Michael

Attachment

Re: non-exclusive backup cleanup is mildly broken

From
Robert Haas
Date:
On Tue, Dec 17, 2019 at 6:40 PM Michael Paquier <michael@paquier.xyz> wrote:
> So that's how you prevent piling up multiple registrations of this
> callback compared to v1.  FWIW, I think that it is a cleaner approach
> to remove the callback once a non-exclusive backup is done, because a
> session has no need for it once it is done with its non-exclusive
> backup, and this session may remain around for some time.

The fact that the session may remain around for some time isn't really
relevant, because the callback isn't consuming any resources. It does
not increase memory usage by a single byte, nor CPU consumption
either. It does consume a few CPU cycles when the backend finally
exits, but the number of such cycles is very small and unrelated to
the length of the session. And removing the callback isn't entirely
free, either.

I think the real point for me is that it's bad to have two sources of
truth. We have the sessionBackupState variable that tells us whether
we're in a backup, and then we separately have whether or not the
callback is registered. If those two things ever get out of sync, as
they do in the test cases that I've proposed, then we have problems --
so it's better not to maintain the state in two separate ways.

The way it's set up right now actually seems quite fragile even apart
from the problem with cancel_before_shmem_exit(). do_pg_stop_backup()
sets sessionBackupState to SESSION_BACKUP_NONE and then does things
that might fail. If they do, then the cancel_before_shmem_exit()
callback will leave the callback installed, which can lead to a
spurious warning or assertion failure later as in the original report.
The only way to avoid that problem would be to move the
cancel_before_shmem_exit() callback so that it happens right next to
setting sessionBackupState to SESSION_BACKUP_NONE -- note the comments
there saying we can't even do WALInsertLockRelease() between updating
XLogCtl and updating sessionBackupState. But that would be very ugly,
because we'd then have to pass a flag to do_pg_stop_backup() saying
whether to remove the callback, since only one caller wants that
behavior.

And, we'd have to change cancel_before_shmem_exit() to search the
whole array, which would almost certainly cost more cycles than
leaving a do-nothing callback around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: non-exclusive backup cleanup is mildly broken

From
Fujii Masao
Date:
On Wed, Dec 18, 2019 at 12:19 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Dec 17, 2019 at 6:40 PM Michael Paquier <michael@paquier.xyz> wrote:
> > So that's how you prevent piling up multiple registrations of this
> > callback compared to v1.  FWIW, I think that it is a cleaner approach
> > to remove the callback once a non-exclusive backup is done, because a
> > session has no need for it once it is done with its non-exclusive
> > backup, and this session may remain around for some time.
>
> The fact that the session may remain around for some time isn't really
> relevant, because the callback isn't consuming any resources. It does
> not increase memory usage by a single byte, nor CPU consumption
> either. It does consume a few CPU cycles when the backend finally
> exits, but the number of such cycles is very small and unrelated to
> the length of the session. And removing the callback isn't entirely
> free, either.
>
> I think the real point for me is that it's bad to have two sources of
> truth. We have the sessionBackupState variable that tells us whether
> we're in a backup, and then we separately have whether or not the
> callback is registered. If those two things ever get out of sync, as
> they do in the test cases that I've proposed, then we have problems --
> so it's better not to maintain the state in two separate ways.
>
> The way it's set up right now actually seems quite fragile even apart
> from the problem with cancel_before_shmem_exit(). do_pg_stop_backup()
> sets sessionBackupState to SESSION_BACKUP_NONE and then does things
> that might fail. If they do, then the cancel_before_shmem_exit()
> callback will leave the callback installed, which can lead to a
> spurious warning or assertion failure later as in the original report.
> The only way to avoid that problem would be to move the
> cancel_before_shmem_exit() callback so that it happens right next to
> setting sessionBackupState to SESSION_BACKUP_NONE -- note the comments
> there saying we can't even do WALInsertLockRelease() between updating
> XLogCtl and updating sessionBackupState. But that would be very ugly,
> because we'd then have to pass a flag to do_pg_stop_backup() saying
> whether to remove the callback, since only one caller wants that
> behavior.
>
> And, we'd have to change cancel_before_shmem_exit() to search the
> whole array, which would almost certainly cost more cycles than
> leaving a do-nothing callback around.

If pg_abort_backup callback function can be called safely even when
the backup is not in progress, we can just use the global variable like
pg_abort_backup_registered to register the callback function only
on first call. In this way, cancel_before_shmem_exit() doesn't need to
search the array to get rid of the function.

Regards,

-- 
Fujii Masao



Re: non-exclusive backup cleanup is mildly broken

From
Kyotaro Horiguchi
Date:
At Tue, 17 Dec 2019 15:48:40 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> I wrote:
> > Took a quick look.  I agree that this seems a lot cleaner than the
> > alternative proposals.  I'd suggest however that the header comment
> > for do_pg_abort_backup could do with more work, perhaps along the
> > lines of "The odd-looking signature allows this to be registered
> > directly as a shmem_exit handler".
> 
> > Personally I'd have kept the handler as a separate function that's just
> > a one-line wrapper around "void do_pg_abort_backup(bool emit_warning)".
> > We don't usually treat callbacks as functions to be also called in
> > their own right.  But if you don't want to do that, I'll settle for an
> > acknowledgement of the hack in the comment.
> 
> Oh, scratch that --- looking closer, I see that the only two use-cases in
> the patched code are via before_shmem_exit and PG_ENSURE_ERROR_CLEANUP,
> and both of those require a function with the signature of an on_exit
> callback.  So there's no need for a separate wrapper because this isn't
> going to be called any other way.  I still recommend amending the
> comment to explain why it has this signature, though.

The existing comment of do_pg_abort_backup follows seems to be a bit
stale to me. I think it can be revised that way.

  * NB: This is only for aborting a non-exclusive backup that doesn't write
  * backup_label. A backup started with pg_start_backup() needs to be finished
  * with pg_stop_backup().


Don't we need a regression test for this behavior?


+ * Register a handler that will warn about unterminated backups at end of
+ * session, unless this has already been done.

Though I'm not sure of the necessity to do, it might need to mention
cleaning up, which is actually done by the function.


Other than the above looks good to me. The patch applies cleanly and
works as expected.


In another branch of this thread,

At Tue, 17 Dec 2019 08:38:13 -0500, Robert Haas <robertmhaas@gmail.com> wrote in 
> On the question of whether the assertion is needed, it is currently
> the case that you could call cancel_before_shmem_exit() without
> knowing whether you'd actually registered a handler or not. With the
> proposed change, that would no longer be legal. Maybe that's OK. But
> it doesn't seem entirely crazy to have some error-recovery path where
> cancel_before_shmem_exit() could get called twice in obscure
> circumstances, and right now, you can rest easy, knowing that the
> second call just won't do anything. If we make it an assertion failure
> to do such things, then you can't. On the other hand, maybe there's no
> use for such a construct, and it'd be better to just reduce confusion.
> Anyway, I think this is a separate topic from fixing this specific
> problem.

The part is not for the case "without knowing whether you'd actually
registered a handler or not", but for the case "without knowing
whether someone could have registered a handler or not after the
registration I made". But there is not so much difference and I don't
insist on that because of the reason you mentioned as above. Thanks
for elaborating.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: non-exclusive backup cleanup is mildly broken

From
Robert Haas
Date:
On Tue, Dec 17, 2019 at 11:36 PM Fujii Masao <masao.fujii@gmail.com> wrote:
> If pg_abort_backup callback function can be called safely even when
> the backup is not in progress, we can just use the global variable like
> pg_abort_backup_registered to register the callback function only
> on first call. In this way, cancel_before_shmem_exit() doesn't need to
> search the array to get rid of the function.

Right. That's how the proposed patch works.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: non-exclusive backup cleanup is mildly broken

From
Michael Paquier
Date:
On Wed, Dec 18, 2019 at 07:43:43AM -0500, Robert Haas wrote:
> On Tue, Dec 17, 2019 at 11:36 PM Fujii Masao <masao.fujii@gmail.com> wrote:
>> If pg_abort_backup callback function can be called safely even when
>> the backup is not in progress, we can just use the global variable like
>> pg_abort_backup_registered to register the callback function only
>> on first call. In this way, cancel_before_shmem_exit() doesn't need to
>> search the array to get rid of the function.
>
> Right. That's how the proposed patch works.

Well, it seems like I am a poor lonesome cowboy on this one then.
And what you are proposing does not break things either as far as I
checked, so I'll just go hide in a corner :)
--
Michael

Attachment

Re: non-exclusive backup cleanup is mildly broken

From
Robert Haas
Date:
On Tue, Dec 17, 2019 at 3:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Oh, scratch that --- looking closer, I see that the only two use-cases in
> the patched code are via before_shmem_exit and PG_ENSURE_ERROR_CLEANUP,
> and both of those require a function with the signature of an on_exit
> callback.

Yeah, that's why I was surprised that you wanted shim functions.

> So there's no need for a separate wrapper because this isn't
> going to be called any other way.  I still recommend amending the
> comment to explain why it has this signature, though.

Done, and committed.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company