Re: [BUGS] Backend crash on non-exclusive backup cancel - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: [BUGS] Backend crash on non-exclusive backup cancel
Date
Msg-id CAB7nPqRCyoJTbgaLs9sH5Ehf0thHZp-KY8hJY6mCkaXAMW0i6w@mail.gmail.com
Whole thread Raw
In response to [BUGS] Backend crash on non-exclusive backup cancel  (David Steele <david@pgmasters.net>)
Responses Re: [BUGS] Backend crash on non-exclusive backup cancel
List pgsql-bugs
On Tue, Feb 28, 2017 at 10:33 AM, David Steele <david@pgmasters.net> wrote:
> This condition should throw "backup is not in progress" just as a
> exclusive backup would, whether assertions are enabled or not.
>
> I believe the solution is to move the exclusive flag to xlog.c and only
> decrement XLogCtl->Insert.nonExclusiveBackups when exclusive is true,
> otherwise return an error.  Even then, it wouldn't be clear if the
> backup had completed or not.

I understand by this sentence that you mean
nonexclusive_backup_running, and I think that I agree with that. We
need a way to reset it properly the session-level switch in case of an
interrupt found, and that needs visibly to happen in
pg_stop_backup_callback and pg_start_backup_callback(). At the same
time I think that exclusive_backup_running had better be moved to
xlog.c as well. If I look at the failure in details when issuing a
cancel, I can see that XLogCtl->Insert.nonExclusiveBackups gets
decremented at the end do_pg_stop_backup, but
nonexclusive_backup_running never gets set back to false because of
the query cancellation.

> I suppose any cancelled non-exclusive
> pg_stop_backup() should be considered aborted whether a stop backup
> record was written or not?

That's not necessarily true, I can see a stop backup able to finish as
well by issuing a cancel request. It seems to me that we just need to
have the shmem information updated at the same time as the
session-level switches for consistency and we're good. The
inconsistency in places when updating the session-level flags and the
shmem-level flags is what is causing harm.

> If that makes sense I'm happy to work up a patch.  This is definitely an
> edge case and I seriously doubt it is causing any issues in the field.

Well, at least it is nothing caused directly by 974ece5, I am able to
crash the server with or without that...

There is actually some code that I know of that can issue a cancel
request on pg_stop_backend(), this does not have assertions of course
but it may become an issue. That's pretty close to the improvements I
did recently for lock handling of exclusive backups, so there's
nothing really new for me :)

David, are you willing to write a patch or should I? Changing
nonexclusive_backup_running/exclusive_backup_running to be an enum
would make the code more readable as well IMO.
-- 
Michael


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

pgsql-bugs by date:

Previous
From: David Steele
Date:
Subject: [BUGS] Backend crash on non-exclusive backup cancel
Next
From: Magnus Hagander
Date:
Subject: Re: [BUGS] BUG #14543: libpq fails with group readable ssl keys