Re: Backup command and functions can cause assertion failure and segmentation fault - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Backup command and functions can cause assertion failure and segmentation fault
Date
Msg-id CAD21AoBkR611M6K+6i-xNDjmCd-=hFN_svNgRtvuezyjviw1sA@mail.gmail.com
Whole thread Raw
In response to Backup command and functions can cause assertion failure and segmentation fault  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: Backup command and functions can cause assertion failure and segmentation fault
List pgsql-hackers
Hi,

On Thu, Jun 30, 2022 at 12:29 PM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:
>
> Hi,
>
> I found that the assertion failure and the segmentation fault could
> happen by running pg_backup_start(), pg_backup_stop() and BASE_BACKUP
> replication command, in v15 or before.
>
> Here is the procedure to reproduce the assertion failure.
>
> 1. Connect to the server as the REPLICATION user who is granted
>     EXECUTE to run pg_backup_start() and pg_backup_stop().
>
>      $ psql
>      =# CREATE ROLE foo REPLICATION LOGIN;
>      =# GRANT EXECUTE ON FUNCTION pg_backup_start TO foo;
>      =# GRANT EXECUTE ON FUNCTION pg_backup_stop TO foo;
>      =# \q
>
>      $ psql "replication=database user=foo dbname=postgres"
>
> 2. Run pg_backup_start() and pg_backup_stop().
>
>      => SELECT pg_backup_start('test', true);
>      => SELECT pg_backup_stop();
>
> 3. Run BASE_BACKUP replication command with smaller MAX_RATE so that
>     it can take a long time to finish.
>
>      => BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32);
>
> 4. Terminate the replication connection while it's running BASE_BACKUP.
>
>      $ psql
>      =# SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE backend_type = 'walsender';
>
> This procedure can cause the following assertion failure.
>
> TRAP: FailedAssertion("XLogCtl->Insert.runningBackups > 0", File: "xlog.c", Line: 8779, PID: 69434)
> 0   postgres                            0x000000010ab2ff7f ExceptionalCondition + 223
> 1   postgres                            0x000000010a455126 do_pg_abort_backup + 102
> 2   postgres                            0x000000010a8e13aa shmem_exit + 218
> 3   postgres                            0x000000010a8e11ed proc_exit_prepare + 125
> 4   postgres                            0x000000010a8e10f3 proc_exit + 19
> 5   postgres                            0x000000010ab3171c errfinish + 1100
> 6   postgres                            0x000000010a91fa80 ProcessInterrupts + 1376
> 7   postgres                            0x000000010a886907 throttle + 359
> 8   postgres                            0x000000010a88675d bbsink_throttle_archive_contents + 29
> 9   postgres                            0x000000010a885aca bbsink_archive_contents + 154
> 10  postgres                            0x000000010a885a2a bbsink_forward_archive_contents + 218
> 11  postgres                            0x000000010a884a99 bbsink_progress_archive_contents + 89
> 12  postgres                            0x000000010a881aba bbsink_archive_contents + 154
> 13  postgres                            0x000000010a881598 sendFile + 1816
> 14  postgres                            0x000000010a8806c5 sendDir + 3573
> 15  postgres                            0x000000010a8805d9 sendDir + 3337
> 16  postgres                            0x000000010a87e262 perform_base_backup + 1250
> 17  postgres                            0x000000010a87c734 SendBaseBackup + 500
> 18  postgres                            0x000000010a89a7f8 exec_replication_command + 1144
> 19  postgres                            0x000000010a92319a PostgresMain + 2154
> 20  postgres                            0x000000010a82b702 BackendRun + 50
> 21  postgres                            0x000000010a82acfc BackendStartup + 524
> 22  postgres                            0x000000010a829b2c ServerLoop + 716
> 23  postgres                            0x000000010a827416 PostmasterMain + 6470
> 24  postgres                            0x000000010a703e19 main + 809
> 25  libdyld.dylib                       0x00007fff2072ff3d start + 1
>
>
> Here is the procedure to reproduce the segmentation fault.
>
> 1. Connect to the server as the REPLICATION user who is granted
>     EXECUTE to run pg_backup_stop().
>
>      $ psql
>      =# CREATE ROLE foo REPLICATION LOGIN;
>      =# GRANT EXECUTE ON FUNCTION pg_backup_stop TO foo;
>      =# \q
>
>      $ psql "replication=database user=foo dbname=postgres"
>
> 2. Run BASE_BACKUP replication command with smaller MAX_RATE so that
>     it can take a long time to finish.
>
>      => BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32);
>
> 3. Press Ctrl-C to cancel BASE_BACKUP while it's running.
>
> 4. Run pg_backup_stop().
>
>      => SELECT pg_backup_stop();
>
> This procedure can cause the following segmentation fault.
>
>      LOG:  server process (PID 69449) was terminated by signal 11: Segmentation fault: 11
>      DETAIL:  Failed process was running: SELECT pg_backup_stop();
>
>
> The root cause of these failures seems that sessionBackupState flag
> is not reset to SESSION_BACKUP_NONE even when BASE_BACKUP is aborted.
> So attached patch changes do_pg_abort_backup callback so that
> it resets sessionBackupState. I confirmed that, with the patch,
> those assertion failure and segmentation fault didn't happen.

The change looks good to me. I've also confirmed the change fixed the issues.

> But this change has one issue that; if BASE_BACKUP is run while
> a backup is already in progress in the session by pg_backup_start()
> and that session is terminated, the change causes XLogCtl->Insert.runningBackups
> to be decremented incorrectly. That is, XLogCtl->Insert.runningBackups
> is incremented by two by pg_backup_start() and BASE_BACKUP,
> but it's decremented only by one by the termination of the session.
>
> To address this issue, I think that we should disallow BASE_BACKUP
> to run while a backup is already in progress in the *same* session
> as we already do this for pg_backup_start(). Thought? I included
> the code to disallow that in the attached patch.

+1

@@ -233,6 +233,12 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
    StringInfo  labelfile;
    StringInfo  tblspc_map_file;
    backup_manifest_info manifest;
+   SessionBackupState status = get_backup_status();
+
+   if (status == SESSION_BACKUP_RUNNING)
+       ereport(ERROR,
+               (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                errmsg("a backup is already in progress in this session")));

I think we can move it to the beginning of SendBaseBackup() so we can
avoid bbsink initialization and cleanup in the error case.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Issue with pg_stat_subscription_stats
Next
From: Dilip Kumar
Date:
Subject: Re: making relfilenodes 56 bits