Re: [BUGS] Backend crash on non-exclusive backup cancel - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: [BUGS] Backend crash on non-exclusive backup cancel
Date
Msg-id CAB7nPqR9xvq_wYKFjUYF1CECFLRbe6xGhGWaCv_yE8pZU71Ztw@mail.gmail.com
Whole thread Raw
In response to Re: [BUGS] Backend crash on non-exclusive backup cancel  (David Steele <david@pgmasters.net>)
Responses Re: [BUGS] Backend crash on non-exclusive backup cancel
List pgsql-bugs
On Tue, Feb 28, 2017 at 10:21 PM, David Steele <david@pgmasters.net> wrote:
> On 2/27/17 10:05 PM, Michael Paquier wrote:
>> That's not necessarily true, I can see a stop backup able to finish as
>> well by issuing a cancel request. It seems to me that we just need to
>> have the shmem information updated at the same time as the
>> session-level switches for consistency and we're good. The
>> inconsistency in places when updating the session-level flags and the
>> shmem-level flags is what is causing harm.
>
> I'm sure this could be done but it will require quite a bit of
> refactoring and I'm not sure that it's worth it.  In my mind it would be
> enough to document that cancelled backups should be considered invalid
> whether the stop backup record was written or not.  However, I'm willing
> to go with the majority opinion.

I would think that addressing the problem is the way to go, hitting an
assertion in this code path means that we are doing something wrong so
simply removing it does not sound like a correct answer to me.

>>> If that makes sense I'm happy to work up a patch.  This is definitely an
>>> edge case and I seriously doubt it is causing any issues in the field.
>>
>> <...>
>>
>> David, are you willing to write a patch or should I? Changing
>> nonexclusive_backup_running/exclusive_backup_running to be an enum
>> would make the code more readable as well IMO.
>
> I would like to see if anyone else weighs in on this first, but yes I am
> planning to write a patch.  Agreed on the enum.

This was itching me yesterday so I wrote a patch able to address this
problem... At least it can be registered to the CF on time and give it
more visibility. By the way, I have noticed a second bug in the
current logic while looking at the problem you have reported:
1) Start backup in session 1:
=# select pg_start_backup('popo');
 pg_start_backup
-----------------
 0/2000028
(1 row)
2) pg_stop_backup() in session 2
3) And now when trying to work on backups with session 1:
=# select pg_start_backup('popo');
ERROR:  55000: a backup is already in progress in this session
LOCATION:  pg_start_backup, xlogfuncs.c:87
=# select pg_stop_backup();
ERROR:  55000: exclusive backup not in progress
LOCATION:  do_pg_stop_backup, xlog.c:10642

So the handling around exclusive_backup_running is broken now as it is
possible to lock a session easily when handling backups. And the root
of the problem is that checks on exclusive_backup_running are not
necessary. I have fixed as well this problem as per the attached. Note
however that SESSION_BACKUP_EXCLUSIVE still makes sense in the patch
attached when doing checks with pg_stop_backup_v2() for a
non-exclusive backup run. It is also useful for debugging.
-- 
Michael

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment

pgsql-bugs by date:

Previous
From: Stephen Frost
Date:
Subject: Re: [BUGS] BUG #14543: libpq fails with group readable ssl keys
Next
From: David Steele
Date:
Subject: Re: [BUGS] Backend crash on non-exclusive backup cancel