Thread: ReplicationSlotRelease may set the statusFlags of other processes in PG14

Hello hackers,

I encountered an issue where an assertion failure occurred for
ProcGlobal->statusFlags. The detailed analysis is as follows:

# Issue background

PostgreSQL version: REL_14_STABLE v14.11 3621ffd9f2

The assert failure occurs when the autovacuum worker process commits a
transaction: proc->statusFlags is 3, and ProcGlobal->statusFlags[proc->pgxactoff]
is 0. The specific stack trace is as follows:
```
#0  0x00007f174c401277 in raise () from /lib64/libc.so.6
#1  0x00007f174c402968 in abort () from /lib64/libc.so.6
#2  0x00000000010f1403 in ExceptionalCondition (conditionName=0x1728760 "proc->statusFlags == ProcGlobal->statusFlags[proc->pgxactoff]", errorType=0x1728420 "FailedAssertion", fileName=0x17284b8 "procarray.c", lineNumber=776) at assert.c:44
#3  0x0000000000e997d0 in ProcArrayEndTransaction (proc=0x7f1731630f00, latestXid=0) at procarray.c:776
#4  0x00000000008cfb18 in CommitTransaction () at xact.c:2447
#5  0x00000000008d090e in CommitTransactionCommand () at xact.c:3293
#6  0x0000000000bb5652 in vacuum_rel (relid=2608, relation=0x56f76f8, params=0x56e399c) at vacuum.c:2174
#7  0x0000000000bb300e in vacuum (relations=0x5535b40, params=0x56e399c, bstrategy=0x56e3a30, isTopLevel=true) at vacuum.c:486
#8  0x0000000000dc01a9 in autovacuum_do_vac_analyze (tab=0x56e3998, bstrategy=0x56e3a30) at autovacuum.c:3467
#9  0x0000000000dbeb94 in do_autovacuum () at autovacuum.c:2632
#10 0x0000000000dbd530 in AutoVacWorkerMain (argc=0, argv=0x0) at autovacuum.c:1769
#11 0x0000000000dbd12a in StartAutoVacWorker () at autovacuum.c:1545
#12 0x0000000000dd894f in StartAutovacuumWorker () at postmaster.c:6441
#13 0x0000000000dd8189 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:6108
#14 <signal handler called>
#15 0x00007f174c4bffc3 in __select_nocancel () from /lib64/libc.so.6
#16 0x0000000000dd1d67 in ServerLoop () at postmaster.c:1881
#17 0x0000000000dd16d9 in PostmasterMain (argc=3, argv=0x533e060) at postmaster.c:1582
#18 0x0000000000ca4af5 in main (argc=3, argv=0x533e060) at main.c:216
```

# Analysis and Reproduction
## Analysis

A process utilizing replication slots (usually walsender) calls callback
functions in the order of RemoveProcFromArray->ProcKill upon abnormal exit.
Within RemoveProcFromArray, MyProc is already removed from the ProcArray.
ProcKill then attempts to set ProcGlobal->statusFlags[MyProc->pgxactoff] again
via ReplicationSlotRelease. By this time, the flag may already be assigned to
another process.

## Reproduction

This issue only occurs in PG14. In versions prior to PG14, ProcGlobal->statusFlags
had not been refactored. In versions following PG14, it has been fixed by commit
2f6501f (discussed in [1]). Therefore, I've cc'd Masahiko, Kyotaro, and Andres in
hopes that they can provide some insights.

To replicate the issue, execute the following steps:
1. Apply the attached v1-0000-v14-invalidate-pgxactoff-after-remove-pgproc.patch,
   where pgxactoff is set to an invalid value in ProcArrayRemove, and some
   checks are added.
2. Use the SQL below to terminate the walsender process.
```
select pg_terminate_backend(pid) from pg_stat_activity where backend_type = 'walsender';
```

# Fix

To fix the issue, I have provided some patches in the attachment:
1. Backpatching 2f6501f into the PG14 version will fix the problem.
2. In PG14-head, ProcArrayRemove needs to reset pgxactoff, and some assert
   checks should be done when setting ProcGlobal->statusFlags.


[1] https://www.postgresql.org/message-id/29273AF3-9684-4069-9257-D05090B03B99@amazon.com


Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.
 
Attachment
On Sat, Mar 16, 2024 at 10:29:03PM +0800, feichanghong wrote:
> A process utilizing replication slots (usually walsender) calls callback
> functions in the order of RemoveProcFromArray->ProcKill upon abnormal exit.
> Within RemoveProcFromArray, MyProc is already removed from the ProcArray.
> ProcKill then attempts to set ProcGlobal->statusFlags[MyProc->pgxactoff] again
> via ReplicationSlotRelease. By this time, the flag may already be assigned to
> another process.

Oops.

> To replicate the issue, execute the following steps:
> 1. Apply the attached v1-0000-v14-invalidate-pgxactoff-after-remove-pgproc.patch,
> where pgxactoff is set to an invalid value in ProcArrayRemove, and some
> checks are added.
> 2. Use the SQL below to terminate the walsender process.
> ```
> select pg_terminate_backend(pid) from pg_stat_activity where backend_type = 'walsender';
> ```
> # Fix
>
> To fix the issue, I have provided some patches in the attachment:
> 1. Backpatching 2f6501f into the PG14 version will fix the problem.
> 2. In PG14-head, ProcArrayRemove needs to reset pgxactoff, and some assert
> checks should be done when setting ProcGlobal->statusFlags.

Yeah, that's something that we had better fix in all stable branches.
The asserts would offer some protection moving on, but I would take
the safer move of only adding a protection like what you are
suggestion on HEAD and not in stable branches, just in case we're
missing something around them.
--
Michael

Attachment
Dear Michael,

Thank you for your attention.

On Mar 19, 2024, at 11:57, Michael Paquier <michael@paquier.xyz> wrote:

Yeah, that's something that we had better fix in all stable branches.

Yes, as I mentioned in my last reply, it is only necessary to fix the issue in V14.

The asserts would offer some protection moving on, but I would take
the safer move of only adding a protection like what you are
suggestion on HEAD and not in stable branches, just in case we're
missing something around them.

I agree it, we may only need to add assert checks on HEAD.


Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.