Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted. - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted. |
Date | |
Msg-id | CAB7nPqRVOuRWYqcf_BvHrPWnGDfwy04ZaghKVHUP0y6oeG_u-g@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted. (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.
|
List | pgsql-hackers |
On Thu, Nov 16, 2017 at 8:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Nov 16, 2017 at 1:11 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> + /* >> + * Quick exit if session is not keeping around a non-exclusive backup >> + * already started. >> + */ >> + if (sessionBackupState != SESSION_BACKUP_NON_EXCLUSIVE) >> + return; >> I think that it would be more solid to use SESSION_BACKUP_NONE for the >> comparison, and complete the assertion after the quick exit as follows >> as this code path should never be taken for an exclusive backup: > > Agreed. I have spent some time doing an extra lookup with tests involving one and two sessions doing backups checking for failure code paths while waiting for archives: - One session with non-exclusive backup. - One session with exclusive backup. - One session is exclusive and the second is non-exclusive - Both are exclusive. Also double-checked on the way the logic around the cleanup callback and sessionBackupState during startup, and those look clean to me. One thing that was bothering me is that the callback nonexclusive_base_backup_cleanup is called after do_pg_start_backup() finishes. But between the moment sessionBackupState is set and the callback is registered there is no CHECK_FOR_INTERRUPTS or even elog() calls so even if the process starting a non-exclusive backup is tried to be terminated between the moment the session lock is set and the callback is registered things are handled correctly. So I think that we are basically safe for backups running with the SQL interface. However, things are not completely clean for base backups taken using the replication protocol. While monitoring more the code, I have noticed that perform_base_backup() calls do_pg_stop_backup() *without* taking any cleanup action. So if a base backup is interrupted, say with SIGTERM, while do_pg_stop_backup() is called and before the session lock is updated then it is possible to finish with inconsistent in-memory counters. Oops. No need to play with breakpoints and signals in this case, using something like that is enough to create inconsistent counters. --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -327,6 +327,8 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir) } PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup,(Datum) 0); + elog(ERROR, "base backups don't decrement counters here, stupid!"); + endptr = do_pg_stop_backup(labelfile->data, !opt->nowait, &endtli); A simple fix for this one is to call do_pg_stop_backup() before PG_END_ENSURE_ERROR_CLEANUP(). By doing so, we also make the callback cleanup logic consistent with what is done for the SQL equivalent where the callback is removed after finishing going through do_pg_stop_backup(). A comment would be adapted here, say something like "finish the backup while still holding the cleanup callback to avoid inconsistent in-memory data should the this call fail before sessionBackupState is updated." For the start phase, the current logic is fine, because in the case of the SQL interface the cleanup callback is registered after finishing do_pg_start_backup(). What do you think? -- Michael
pgsql-hackers by date: