Re: Backup command and functions can cause assertion failure and segmentation fault - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: Backup command and functions can cause assertion failure and segmentation fault
Date
Msg-id bef8abcd-8cb8-1320-9f2b-bc9677413625@oss.nttdata.com
Whole thread Raw
In response to Backup command and functions can cause assertion failure and segmentation fault  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: Backup command and functions can cause assertion failure and segmentation fault
List pgsql-hackers

On 2022/07/07 9:09, Michael Paquier wrote:
> On Wed, Jul 06, 2022 at 11:27:58PM +0900, Fujii Masao wrote:
>> For the test, BASE_BACKUP needs to be canceled after it finishes
>> do_pg_backup_start(), i.e., checkpointing, and before it calls
>> do_pg_backup_stop(). So the timing to cancel that seems more severe
>> than the test added in 0475a97f. I'm afraid that some tests can
>> easily cancel the BASE_BACKUP while it's performing a checkpoint in
>> do_pg_backup_start(). So for now I'm thinking to avoid such an
>> unstable test.
> 
> Hmm.  In order to make sure that the checkpoint of the base backup is
> completed, and assuming that the checkpoint is fast while the base
> backup has a max rate, you could rely on a query that does a
> poll_query_until() on pg_control_checkpoint(), no?  As long as you use
> IPC::Run::start, pg_basebackup would be async so the polling query and
> the cancellation can be done in parallel of it.  0475a97 did almost
> that, except that it waits for the WAL sender to be started.

There seems to be some corner cases where we cannot rely on that.

If "spread" checkpoint is already running when BASE_BACKUP is executed, poll_query_until() may report the end of that
"spread"checkpoint before BASE_BACKUP internally starts its checkpoint. Which may cause the test to fail.
 

If BASE_BACKUP is accidentally canceled after poll_query_until() reports the end of checkpoint but before
do_pg_backup_start()finishes (i.e., before entering the error cleanup block using do_pg_abort_backup callback), the
testmay fail.
 

Probably we may be able to decrease the risk of those test failures by using some techniques, e.g., adding the fixed
waittime before requesting the cancel. But I'm not sure if it's worth adding the test for the corner case issue that I
reportedat the risk of adding the unstable test. The issue could happen only when both BASE_BACKUP and low level API
forbackup are eecuted via logical replication walsender mode, and BASE_BACKUP is canceled or terminated.
 

But if many think that it's worth adding the test, I will give a try. But even in that case, I think it's better to
committhe proposed patch at first to fix the bug, and then to write the patch adding the test.
 

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: remove more archiving overhead
Next
From: Alvaro Herrera
Date:
Subject: Re: archive modules