Thread: [BUGS] Backend crash on non-exclusive backup cancel
I found this issue while working on a pg_stop_backup() patch. If a non-exclusive pg_stop_backup() is cancelled and then attempted again the backend will crash on assertion: $ test/pg/bin/psql psql (10devel) Type "help" for help. postgres=# select * from pg_start_backup('label', true, false); pg_start_backup ----------------- 0/2000028 (1 row) postgres=# select * from pg_stop_backup(false); ^CCancel request sent ERROR: canceling statement due to user request postgres=# select * from pg_stop_backup(false); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> \q From the server log: 2017-02-28 01:21:34.755 UTC STATEMENT: select * from pg_stop_backup(false); TRAP: FailedAssertion("!(XLogCtl->Insert.nonExclusiveBackups > 0)", File: "/postgres/src/backend/access/transam/xlog.c", Line: 10723) This error was produced in master at 30df93f. Configure settings are --enable-cassert --enable-tap-tests --with-openssl. Disabling assertions "works", but there is still a problem. A backend that keeps cancelling pg_stop_backup() without ever resetting the exclusive flag in xlogfunc.c can decrement the the shared variable XLogCtl->Insert.nonExclusiveBackups as many times as it wants. As far as I can see the worst that will happen is that XLogCtl->Insert.forcePageWrites won't get set back to false, but that's still a bug. This condition should throw "backup is not in progress" just as a exclusive backup would, whether assertions are enabled or not. I believe the solution is to move the exclusive flag to xlog.c and only decrement XLogCtl->Insert.nonExclusiveBackups when exclusive is true, otherwise return an error. Even then, it wouldn't be clear if the backup had completed or not. I suppose any cancelled non-exclusive pg_stop_backup() should be considered aborted whether a stop backup record was written or not? If that makes sense I'm happy to work up a patch. This is definitely an edge case and I seriously doubt it is causing any issues in the field. -- -David david@pgmasters.net -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On Tue, Feb 28, 2017 at 10:33 AM, David Steele <david@pgmasters.net> wrote: > This condition should throw "backup is not in progress" just as a > exclusive backup would, whether assertions are enabled or not. > > I believe the solution is to move the exclusive flag to xlog.c and only > decrement XLogCtl->Insert.nonExclusiveBackups when exclusive is true, > otherwise return an error. Even then, it wouldn't be clear if the > backup had completed or not. I understand by this sentence that you mean nonexclusive_backup_running, and I think that I agree with that. We need a way to reset it properly the session-level switch in case of an interrupt found, and that needs visibly to happen in pg_stop_backup_callback and pg_start_backup_callback(). At the same time I think that exclusive_backup_running had better be moved to xlog.c as well. If I look at the failure in details when issuing a cancel, I can see that XLogCtl->Insert.nonExclusiveBackups gets decremented at the end do_pg_stop_backup, but nonexclusive_backup_running never gets set back to false because of the query cancellation. > I suppose any cancelled non-exclusive > pg_stop_backup() should be considered aborted whether a stop backup > record was written or not? That's not necessarily true, I can see a stop backup able to finish as well by issuing a cancel request. It seems to me that we just need to have the shmem information updated at the same time as the session-level switches for consistency and we're good. The inconsistency in places when updating the session-level flags and the shmem-level flags is what is causing harm. > If that makes sense I'm happy to work up a patch. This is definitely an > edge case and I seriously doubt it is causing any issues in the field. Well, at least it is nothing caused directly by 974ece5, I am able to crash the server with or without that... There is actually some code that I know of that can issue a cancel request on pg_stop_backend(), this does not have assertions of course but it may become an issue. That's pretty close to the improvements I did recently for lock handling of exclusive backups, so there's nothing really new for me :) David, are you willing to write a patch or should I? Changing nonexclusive_backup_running/exclusive_backup_running to be an enum would make the code more readable as well IMO. -- Michael -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On 2/27/17 10:05 PM, Michael Paquier wrote: > On Tue, Feb 28, 2017 at 10:33 AM, David Steele <david@pgmasters.net> wrote: >> >> I believe the solution is to move the exclusive flag to xlog.c and only >> decrement XLogCtl->Insert.nonExclusiveBackups when exclusive is true, >> otherwise return an error. Even then, it wouldn't be clear if the >> backup had completed or not. > > I understand by this sentence that you mean > nonexclusive_backup_running, and I think that I agree with that. Whoops! Yes, I meant the nonexclusive_backup_running flag. > We > need a way to reset it properly the session-level switch in case of an > interrupt found, and that needs visibly to happen in > pg_stop_backup_callback and pg_start_backup_callback(). At the same > time I think that exclusive_backup_running had better be moved to > xlog.c as well. If I look at the failure in details when issuing a > cancel, Agreed. > I can see that XLogCtl->Insert.nonExclusiveBackups gets > decremented at the end do_pg_stop_backup, but > nonexclusive_backup_running never gets set back to false because of > the query cancellation. Exactly. >> I suppose any cancelled non-exclusive >> pg_stop_backup() should be considered aborted whether a stop backup >> record was written or not? > > That's not necessarily true, I can see a stop backup able to finish as > well by issuing a cancel request. It seems to me that we just need to > have the shmem information updated at the same time as the > session-level switches for consistency and we're good. The > inconsistency in places when updating the session-level flags and the > shmem-level flags is what is causing harm. I'm sure this could be done but it will require quite a bit of refactoring and I'm not sure that it's worth it. In my mind it would be enough to document that cancelled backups should be considered invalid whether the stop backup record was written or not. However, I'm willing to go with the majority opinion. >> If that makes sense I'm happy to work up a patch. This is definitely an >> edge case and I seriously doubt it is causing any issues in the field. > > <...> > > David, are you willing to write a patch or should I? Changing > nonexclusive_backup_running/exclusive_backup_running to be an enum > would make the code more readable as well IMO. I would like to see if anyone else weighs in on this first, but yes I am planning to write a patch. Agreed on the enum. -- -David david@pgmasters.net -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On Tue, Feb 28, 2017 at 10:21 PM, David Steele <david@pgmasters.net> wrote: > On 2/27/17 10:05 PM, Michael Paquier wrote: >> That's not necessarily true, I can see a stop backup able to finish as >> well by issuing a cancel request. It seems to me that we just need to >> have the shmem information updated at the same time as the >> session-level switches for consistency and we're good. The >> inconsistency in places when updating the session-level flags and the >> shmem-level flags is what is causing harm. > > I'm sure this could be done but it will require quite a bit of > refactoring and I'm not sure that it's worth it. In my mind it would be > enough to document that cancelled backups should be considered invalid > whether the stop backup record was written or not. However, I'm willing > to go with the majority opinion. I would think that addressing the problem is the way to go, hitting an assertion in this code path means that we are doing something wrong so simply removing it does not sound like a correct answer to me. >>> If that makes sense I'm happy to work up a patch. This is definitely an >>> edge case and I seriously doubt it is causing any issues in the field. >> >> <...> >> >> David, are you willing to write a patch or should I? Changing >> nonexclusive_backup_running/exclusive_backup_running to be an enum >> would make the code more readable as well IMO. > > I would like to see if anyone else weighs in on this first, but yes I am > planning to write a patch. Agreed on the enum. This was itching me yesterday so I wrote a patch able to address this problem... At least it can be registered to the CF on time and give it more visibility. By the way, I have noticed a second bug in the current logic while looking at the problem you have reported: 1) Start backup in session 1: =# select pg_start_backup('popo'); pg_start_backup ----------------- 0/2000028 (1 row) 2) pg_stop_backup() in session 2 3) And now when trying to work on backups with session 1: =# select pg_start_backup('popo'); ERROR: 55000: a backup is already in progress in this session LOCATION: pg_start_backup, xlogfuncs.c:87 =# select pg_stop_backup(); ERROR: 55000: exclusive backup not in progress LOCATION: do_pg_stop_backup, xlog.c:10642 So the handling around exclusive_backup_running is broken now as it is possible to lock a session easily when handling backups. And the root of the problem is that checks on exclusive_backup_running are not necessary. I have fixed as well this problem as per the attached. Note however that SESSION_BACKUP_EXCLUSIVE still makes sense in the patch attached when doing checks with pg_stop_backup_v2() for a non-exclusive backup run. It is also useful for debugging. -- Michael -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
On 2/28/17 9:08 PM, Michael Paquier wrote: > On Tue, Feb 28, 2017 at 10:21 PM, David Steele <david@pgmasters.net> wrote: >> On 2/27/17 10:05 PM, Michael Paquier wrote: >>> That's not necessarily true, I can see a stop backup able to finish as >>> well by issuing a cancel request. It seems to me that we just need to >>> have the shmem information updated at the same time as the >>> session-level switches for consistency and we're good. The >>> inconsistency in places when updating the session-level flags and the >>> shmem-level flags is what is causing harm. >> >> I'm sure this could be done but it will require quite a bit of >> refactoring and I'm not sure that it's worth it. In my mind it would be >> enough to document that cancelled backups should be considered invalid >> whether the stop backup record was written or not. However, I'm willing >> to go with the majority opinion. > > I would think that addressing the problem is the way to go, hitting an > assertion in this code path means that we are doing something wrong so > simply removing it does not sound like a correct answer to me. > >>>> If that makes sense I'm happy to work up a patch. This is definitely an >>>> edge case and I seriously doubt it is causing any issues in the field. >>> >>> <...> >>> >>> David, are you willing to write a patch or should I? Changing >>> nonexclusive_backup_running/exclusive_backup_running to be an enum >>> would make the code more readable as well IMO. >> >> I would like to see if anyone else weighs in on this first, but yes I am >> planning to write a patch. Agreed on the enum. > > This was itching me yesterday so I wrote a patch able to address this > problem... I know that feeling. > At least it can be registered to the CF on time and give it > more visibility. Indeed. > By the way, I have noticed a second bug in the > current logic while looking at the problem you have reported: > 1) Start backup in session 1: > =# select pg_start_backup('popo'); > pg_start_backup > ----------------- > 0/2000028 > (1 row) > 2) pg_stop_backup() in session 2 > 3) And now when trying to work on backups with session 1: > =# select pg_start_backup('popo'); > ERROR: 55000: a backup is already in progress in this session > LOCATION: pg_start_backup, xlogfuncs.c:87 > =# select pg_stop_backup(); > ERROR: 55000: exclusive backup not in progress > LOCATION: do_pg_stop_backup, xlog.c:10642 Yeah, that doesn't look good. > So the handling around exclusive_backup_running is broken now as it is > possible to lock a session easily when handling backups. And the root > of the problem is that checks on exclusive_backup_running are not > necessary. I have fixed as well this problem as per the attached. Note > however that SESSION_BACKUP_EXCLUSIVE still makes sense in the patch > attached when doing checks with pg_stop_backup_v2() for a > non-exclusive backup run. It is also useful for debugging. Excellent! I'll have a look, and thanks for working up a patch. -- -David david@pgmasters.net -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs