Re: Backup command and functions can cause assertion failure and segmentation fault - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Backup command and functions can cause assertion failure and segmentation fault |
Date | |
Msg-id | CAD21AoBkR611M6K+6i-xNDjmCd-=hFN_svNgRtvuezyjviw1sA@mail.gmail.com Whole thread Raw |
In response to | Backup command and functions can cause assertion failure and segmentation fault (Fujii Masao <masao.fujii@oss.nttdata.com>) |
Responses |
Re: Backup command and functions can cause assertion failure and segmentation fault
|
List | pgsql-hackers |
Hi, On Thu, Jun 30, 2022 at 12:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > Hi, > > I found that the assertion failure and the segmentation fault could > happen by running pg_backup_start(), pg_backup_stop() and BASE_BACKUP > replication command, in v15 or before. > > Here is the procedure to reproduce the assertion failure. > > 1. Connect to the server as the REPLICATION user who is granted > EXECUTE to run pg_backup_start() and pg_backup_stop(). > > $ psql > =# CREATE ROLE foo REPLICATION LOGIN; > =# GRANT EXECUTE ON FUNCTION pg_backup_start TO foo; > =# GRANT EXECUTE ON FUNCTION pg_backup_stop TO foo; > =# \q > > $ psql "replication=database user=foo dbname=postgres" > > 2. Run pg_backup_start() and pg_backup_stop(). > > => SELECT pg_backup_start('test', true); > => SELECT pg_backup_stop(); > > 3. Run BASE_BACKUP replication command with smaller MAX_RATE so that > it can take a long time to finish. > > => BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32); > > 4. Terminate the replication connection while it's running BASE_BACKUP. > > $ psql > =# SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE backend_type = 'walsender'; > > This procedure can cause the following assertion failure. > > TRAP: FailedAssertion("XLogCtl->Insert.runningBackups > 0", File: "xlog.c", Line: 8779, PID: 69434) > 0 postgres 0x000000010ab2ff7f ExceptionalCondition + 223 > 1 postgres 0x000000010a455126 do_pg_abort_backup + 102 > 2 postgres 0x000000010a8e13aa shmem_exit + 218 > 3 postgres 0x000000010a8e11ed proc_exit_prepare + 125 > 4 postgres 0x000000010a8e10f3 proc_exit + 19 > 5 postgres 0x000000010ab3171c errfinish + 1100 > 6 postgres 0x000000010a91fa80 ProcessInterrupts + 1376 > 7 postgres 0x000000010a886907 throttle + 359 > 8 postgres 0x000000010a88675d bbsink_throttle_archive_contents + 29 > 9 postgres 0x000000010a885aca bbsink_archive_contents + 154 > 10 postgres 0x000000010a885a2a bbsink_forward_archive_contents + 218 > 11 postgres 0x000000010a884a99 bbsink_progress_archive_contents + 89 > 12 postgres 0x000000010a881aba bbsink_archive_contents + 154 > 13 postgres 0x000000010a881598 sendFile + 1816 > 14 postgres 0x000000010a8806c5 sendDir + 3573 > 15 postgres 0x000000010a8805d9 sendDir + 3337 > 16 postgres 0x000000010a87e262 perform_base_backup + 1250 > 17 postgres 0x000000010a87c734 SendBaseBackup + 500 > 18 postgres 0x000000010a89a7f8 exec_replication_command + 1144 > 19 postgres 0x000000010a92319a PostgresMain + 2154 > 20 postgres 0x000000010a82b702 BackendRun + 50 > 21 postgres 0x000000010a82acfc BackendStartup + 524 > 22 postgres 0x000000010a829b2c ServerLoop + 716 > 23 postgres 0x000000010a827416 PostmasterMain + 6470 > 24 postgres 0x000000010a703e19 main + 809 > 25 libdyld.dylib 0x00007fff2072ff3d start + 1 > > > Here is the procedure to reproduce the segmentation fault. > > 1. Connect to the server as the REPLICATION user who is granted > EXECUTE to run pg_backup_stop(). > > $ psql > =# CREATE ROLE foo REPLICATION LOGIN; > =# GRANT EXECUTE ON FUNCTION pg_backup_stop TO foo; > =# \q > > $ psql "replication=database user=foo dbname=postgres" > > 2. Run BASE_BACKUP replication command with smaller MAX_RATE so that > it can take a long time to finish. > > => BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32); > > 3. Press Ctrl-C to cancel BASE_BACKUP while it's running. > > 4. Run pg_backup_stop(). > > => SELECT pg_backup_stop(); > > This procedure can cause the following segmentation fault. > > LOG: server process (PID 69449) was terminated by signal 11: Segmentation fault: 11 > DETAIL: Failed process was running: SELECT pg_backup_stop(); > > > The root cause of these failures seems that sessionBackupState flag > is not reset to SESSION_BACKUP_NONE even when BASE_BACKUP is aborted. > So attached patch changes do_pg_abort_backup callback so that > it resets sessionBackupState. I confirmed that, with the patch, > those assertion failure and segmentation fault didn't happen. The change looks good to me. I've also confirmed the change fixed the issues. > But this change has one issue that; if BASE_BACKUP is run while > a backup is already in progress in the session by pg_backup_start() > and that session is terminated, the change causes XLogCtl->Insert.runningBackups > to be decremented incorrectly. That is, XLogCtl->Insert.runningBackups > is incremented by two by pg_backup_start() and BASE_BACKUP, > but it's decremented only by one by the termination of the session. > > To address this issue, I think that we should disallow BASE_BACKUP > to run while a backup is already in progress in the *same* session > as we already do this for pg_backup_start(). Thought? I included > the code to disallow that in the attached patch. +1 @@ -233,6 +233,12 @@ perform_base_backup(basebackup_options *opt, bbsink *sink) StringInfo labelfile; StringInfo tblspc_map_file; backup_manifest_info manifest; + SessionBackupState status = get_backup_status(); + + if (status == SESSION_BACKUP_RUNNING) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("a backup is already in progress in this session"))); I think we can move it to the beginning of SendBaseBackup() so we can avoid bbsink initialization and cleanup in the error case. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-hackers by date: