Thread: [BUGS] Backend crash on non-exclusive backup cancel

[BUGS] Backend crash on non-exclusive backup cancel

From
David Steele
Date:
I found this issue while working on a pg_stop_backup() patch.  If a
non-exclusive pg_stop_backup() is cancelled and then attempted again the
backend will crash on assertion:

$ test/pg/bin/psql
psql (10devel)
Type "help" for help.

postgres=# select * from pg_start_backup('label', true, false);
 pg_start_backup
-----------------
 0/2000028
(1 row)

postgres=# select * from pg_stop_backup(false);
^CCancel request sent
ERROR:  canceling statement due to user request
postgres=# select * from pg_stop_backup(false);
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> \q

From the server log:

2017-02-28 01:21:34.755 UTC STATEMENT:  select * from pg_stop_backup(false);
TRAP: FailedAssertion("!(XLogCtl->Insert.nonExclusiveBackups > 0)",
File: "/postgres/src/backend/access/transam/xlog.c", Line: 10723)

This error was produced in master at 30df93f.  Configure settings are
--enable-cassert --enable-tap-tests --with-openssl.

Disabling assertions "works", but there is still a problem.  A backend
that keeps cancelling pg_stop_backup() without ever resetting the
exclusive flag in xlogfunc.c can decrement the the shared variable
XLogCtl->Insert.nonExclusiveBackups as many times as it wants.  As far
as I can see the worst that will happen is that
XLogCtl->Insert.forcePageWrites won't get set back to false, but that's
still a bug.

This condition should throw "backup is not in progress" just as a
exclusive backup would, whether assertions are enabled or not.

I believe the solution is to move the exclusive flag to xlog.c and only
decrement XLogCtl->Insert.nonExclusiveBackups when exclusive is true,
otherwise return an error.  Even then, it wouldn't be clear if the
backup had completed or not.  I suppose any cancelled non-exclusive
pg_stop_backup() should be considered aborted whether a stop backup
record was written or not?

If that makes sense I'm happy to work up a patch.  This is definitely an
edge case and I seriously doubt it is causing any issues in the field.

-- 
-David
david@pgmasters.net


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Backend crash on non-exclusive backup cancel

From
Michael Paquier
Date:
On Tue, Feb 28, 2017 at 10:33 AM, David Steele <david@pgmasters.net> wrote:
> This condition should throw "backup is not in progress" just as a
> exclusive backup would, whether assertions are enabled or not.
>
> I believe the solution is to move the exclusive flag to xlog.c and only
> decrement XLogCtl->Insert.nonExclusiveBackups when exclusive is true,
> otherwise return an error.  Even then, it wouldn't be clear if the
> backup had completed or not.

I understand by this sentence that you mean
nonexclusive_backup_running, and I think that I agree with that. We
need a way to reset it properly the session-level switch in case of an
interrupt found, and that needs visibly to happen in
pg_stop_backup_callback and pg_start_backup_callback(). At the same
time I think that exclusive_backup_running had better be moved to
xlog.c as well. If I look at the failure in details when issuing a
cancel, I can see that XLogCtl->Insert.nonExclusiveBackups gets
decremented at the end do_pg_stop_backup, but
nonexclusive_backup_running never gets set back to false because of
the query cancellation.

> I suppose any cancelled non-exclusive
> pg_stop_backup() should be considered aborted whether a stop backup
> record was written or not?

That's not necessarily true, I can see a stop backup able to finish as
well by issuing a cancel request. It seems to me that we just need to
have the shmem information updated at the same time as the
session-level switches for consistency and we're good. The
inconsistency in places when updating the session-level flags and the
shmem-level flags is what is causing harm.

> If that makes sense I'm happy to work up a patch.  This is definitely an
> edge case and I seriously doubt it is causing any issues in the field.

Well, at least it is nothing caused directly by 974ece5, I am able to
crash the server with or without that...

There is actually some code that I know of that can issue a cancel
request on pg_stop_backend(), this does not have assertions of course
but it may become an issue. That's pretty close to the improvements I
did recently for lock handling of exclusive backups, so there's
nothing really new for me :)

David, are you willing to write a patch or should I? Changing
nonexclusive_backup_running/exclusive_backup_running to be an enum
would make the code more readable as well IMO.
-- 
Michael


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Backend crash on non-exclusive backup cancel

From
David Steele
Date:
On 2/27/17 10:05 PM, Michael Paquier wrote:
> On Tue, Feb 28, 2017 at 10:33 AM, David Steele <david@pgmasters.net> wrote:
>>
>> I believe the solution is to move the exclusive flag to xlog.c and only
>> decrement XLogCtl->Insert.nonExclusiveBackups when exclusive is true,
>> otherwise return an error.  Even then, it wouldn't be clear if the
>> backup had completed or not.
> 
> I understand by this sentence that you mean
> nonexclusive_backup_running, and I think that I agree with that. 

Whoops!  Yes, I meant the nonexclusive_backup_running flag.

> We
> need a way to reset it properly the session-level switch in case of an
> interrupt found, and that needs visibly to happen in
> pg_stop_backup_callback and pg_start_backup_callback(). At the same
> time I think that exclusive_backup_running had better be moved to
> xlog.c as well. If I look at the failure in details when issuing a
> cancel, 

Agreed.

> I can see that XLogCtl->Insert.nonExclusiveBackups gets
> decremented at the end do_pg_stop_backup, but
> nonexclusive_backup_running never gets set back to false because of
> the query cancellation.

Exactly.

>> I suppose any cancelled non-exclusive
>> pg_stop_backup() should be considered aborted whether a stop backup
>> record was written or not?
> 
> That's not necessarily true, I can see a stop backup able to finish as
> well by issuing a cancel request. It seems to me that we just need to
> have the shmem information updated at the same time as the
> session-level switches for consistency and we're good. The
> inconsistency in places when updating the session-level flags and the
> shmem-level flags is what is causing harm.

I'm sure this could be done but it will require quite a bit of
refactoring and I'm not sure that it's worth it.  In my mind it would be
enough to document that cancelled backups should be considered invalid
whether the stop backup record was written or not.  However, I'm willing
to go with the majority opinion.

>> If that makes sense I'm happy to work up a patch.  This is definitely an
>> edge case and I seriously doubt it is causing any issues in the field.
> 
> <...>
>
> David, are you willing to write a patch or should I? Changing
> nonexclusive_backup_running/exclusive_backup_running to be an enum
> would make the code more readable as well IMO.

I would like to see if anyone else weighs in on this first, but yes I am
planning to write a patch.  Agreed on the enum.

-- 
-David
david@pgmasters.net


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Backend crash on non-exclusive backup cancel

From
Michael Paquier
Date:
On Tue, Feb 28, 2017 at 10:21 PM, David Steele <david@pgmasters.net> wrote:
> On 2/27/17 10:05 PM, Michael Paquier wrote:
>> That's not necessarily true, I can see a stop backup able to finish as
>> well by issuing a cancel request. It seems to me that we just need to
>> have the shmem information updated at the same time as the
>> session-level switches for consistency and we're good. The
>> inconsistency in places when updating the session-level flags and the
>> shmem-level flags is what is causing harm.
>
> I'm sure this could be done but it will require quite a bit of
> refactoring and I'm not sure that it's worth it.  In my mind it would be
> enough to document that cancelled backups should be considered invalid
> whether the stop backup record was written or not.  However, I'm willing
> to go with the majority opinion.

I would think that addressing the problem is the way to go, hitting an
assertion in this code path means that we are doing something wrong so
simply removing it does not sound like a correct answer to me.

>>> If that makes sense I'm happy to work up a patch.  This is definitely an
>>> edge case and I seriously doubt it is causing any issues in the field.
>>
>> <...>
>>
>> David, are you willing to write a patch or should I? Changing
>> nonexclusive_backup_running/exclusive_backup_running to be an enum
>> would make the code more readable as well IMO.
>
> I would like to see if anyone else weighs in on this first, but yes I am
> planning to write a patch.  Agreed on the enum.

This was itching me yesterday so I wrote a patch able to address this
problem... At least it can be registered to the CF on time and give it
more visibility. By the way, I have noticed a second bug in the
current logic while looking at the problem you have reported:
1) Start backup in session 1:
=# select pg_start_backup('popo');
 pg_start_backup
-----------------
 0/2000028
(1 row)
2) pg_stop_backup() in session 2
3) And now when trying to work on backups with session 1:
=# select pg_start_backup('popo');
ERROR:  55000: a backup is already in progress in this session
LOCATION:  pg_start_backup, xlogfuncs.c:87
=# select pg_stop_backup();
ERROR:  55000: exclusive backup not in progress
LOCATION:  do_pg_stop_backup, xlog.c:10642

So the handling around exclusive_backup_running is broken now as it is
possible to lock a session easily when handling backups. And the root
of the problem is that checks on exclusive_backup_running are not
necessary. I have fixed as well this problem as per the attached. Note
however that SESSION_BACKUP_EXCLUSIVE still makes sense in the patch
attached when doing checks with pg_stop_backup_v2() for a
non-exclusive backup run. It is also useful for debugging.
-- 
Michael

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment

Re: [BUGS] Backend crash on non-exclusive backup cancel

From
David Steele
Date:
On 2/28/17 9:08 PM, Michael Paquier wrote:
> On Tue, Feb 28, 2017 at 10:21 PM, David Steele <david@pgmasters.net> wrote:
>> On 2/27/17 10:05 PM, Michael Paquier wrote:
>>> That's not necessarily true, I can see a stop backup able to finish as
>>> well by issuing a cancel request. It seems to me that we just need to
>>> have the shmem information updated at the same time as the
>>> session-level switches for consistency and we're good. The
>>> inconsistency in places when updating the session-level flags and the
>>> shmem-level flags is what is causing harm.
>>
>> I'm sure this could be done but it will require quite a bit of
>> refactoring and I'm not sure that it's worth it.  In my mind it would be
>> enough to document that cancelled backups should be considered invalid
>> whether the stop backup record was written or not.  However, I'm willing
>> to go with the majority opinion.
> 
> I would think that addressing the problem is the way to go, hitting an
> assertion in this code path means that we are doing something wrong so
> simply removing it does not sound like a correct answer to me.
> 
>>>> If that makes sense I'm happy to work up a patch.  This is definitely an
>>>> edge case and I seriously doubt it is causing any issues in the field.
>>>
>>> <...>
>>>
>>> David, are you willing to write a patch or should I? Changing
>>> nonexclusive_backup_running/exclusive_backup_running to be an enum
>>> would make the code more readable as well IMO.
>>
>> I would like to see if anyone else weighs in on this first, but yes I am
>> planning to write a patch.  Agreed on the enum.
> 
> This was itching me yesterday so I wrote a patch able to address this
> problem... 

I know that feeling.

> At least it can be registered to the CF on time and give it
> more visibility. 

Indeed.

> By the way, I have noticed a second bug in the
> current logic while looking at the problem you have reported:
> 1) Start backup in session 1:
> =# select pg_start_backup('popo');
>  pg_start_backup
> -----------------
>  0/2000028
> (1 row)
> 2) pg_stop_backup() in session 2
> 3) And now when trying to work on backups with session 1:
> =# select pg_start_backup('popo');
> ERROR:  55000: a backup is already in progress in this session
> LOCATION:  pg_start_backup, xlogfuncs.c:87
> =# select pg_stop_backup();
> ERROR:  55000: exclusive backup not in progress
> LOCATION:  do_pg_stop_backup, xlog.c:10642

Yeah, that doesn't look good.

> So the handling around exclusive_backup_running is broken now as it is
> possible to lock a session easily when handling backups. And the root
> of the problem is that checks on exclusive_backup_running are not
> necessary. I have fixed as well this problem as per the attached. Note
> however that SESSION_BACKUP_EXCLUSIVE still makes sense in the patch
> attached when doing checks with pg_stop_backup_v2() for a
> non-exclusive backup run. It is also useful for debugging.

Excellent! I'll have a look, and thanks for working up a patch.

-- 
-David
david@pgmasters.net


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs