Re: [BUGS] Backend crash on non-exclusive backup cancel - Mailing list pgsql-bugs

From David Steele
Subject Re: [BUGS] Backend crash on non-exclusive backup cancel
Date
Msg-id 6a8f9d7e-ce4e-d850-12d3-7787815447ea@pgmasters.net
Whole thread Raw
In response to Re: [BUGS] Backend crash on non-exclusive backup cancel  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-bugs
On 2/28/17 9:08 PM, Michael Paquier wrote:
> On Tue, Feb 28, 2017 at 10:21 PM, David Steele <david@pgmasters.net> wrote:
>> On 2/27/17 10:05 PM, Michael Paquier wrote:
>>> That's not necessarily true, I can see a stop backup able to finish as
>>> well by issuing a cancel request. It seems to me that we just need to
>>> have the shmem information updated at the same time as the
>>> session-level switches for consistency and we're good. The
>>> inconsistency in places when updating the session-level flags and the
>>> shmem-level flags is what is causing harm.
>>
>> I'm sure this could be done but it will require quite a bit of
>> refactoring and I'm not sure that it's worth it.  In my mind it would be
>> enough to document that cancelled backups should be considered invalid
>> whether the stop backup record was written or not.  However, I'm willing
>> to go with the majority opinion.
> 
> I would think that addressing the problem is the way to go, hitting an
> assertion in this code path means that we are doing something wrong so
> simply removing it does not sound like a correct answer to me.
> 
>>>> If that makes sense I'm happy to work up a patch.  This is definitely an
>>>> edge case and I seriously doubt it is causing any issues in the field.
>>>
>>> <...>
>>>
>>> David, are you willing to write a patch or should I? Changing
>>> nonexclusive_backup_running/exclusive_backup_running to be an enum
>>> would make the code more readable as well IMO.
>>
>> I would like to see if anyone else weighs in on this first, but yes I am
>> planning to write a patch.  Agreed on the enum.
> 
> This was itching me yesterday so I wrote a patch able to address this
> problem... 

I know that feeling.

> At least it can be registered to the CF on time and give it
> more visibility. 

Indeed.

> By the way, I have noticed a second bug in the
> current logic while looking at the problem you have reported:
> 1) Start backup in session 1:
> =# select pg_start_backup('popo');
>  pg_start_backup
> -----------------
>  0/2000028
> (1 row)
> 2) pg_stop_backup() in session 2
> 3) And now when trying to work on backups with session 1:
> =# select pg_start_backup('popo');
> ERROR:  55000: a backup is already in progress in this session
> LOCATION:  pg_start_backup, xlogfuncs.c:87
> =# select pg_stop_backup();
> ERROR:  55000: exclusive backup not in progress
> LOCATION:  do_pg_stop_backup, xlog.c:10642

Yeah, that doesn't look good.

> So the handling around exclusive_backup_running is broken now as it is
> possible to lock a session easily when handling backups. And the root
> of the problem is that checks on exclusive_backup_running are not
> necessary. I have fixed as well this problem as per the attached. Note
> however that SESSION_BACKUP_EXCLUSIVE still makes sense in the patch
> attached when doing checks with pg_stop_backup_v2() for a
> non-exclusive backup run. It is also useful for debugging.

Excellent! I'll have a look, and thanks for working up a patch.

-- 
-David
david@pgmasters.net


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [BUGS] Backend crash on non-exclusive backup cancel
Next
From: rick@scienceinvision.com
Date:
Subject: [BUGS] BUG #14572: pgadmin restore command path error