Thread: ReplicationSlotRelease() crashes when the instance is in the single user mode

ReplicationSlotRelease() crashes when the instance is in the single user mode

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

I found $SUBJECT when I'm playing with the single user mode.

How to reproduce
===========
You can reproduce the failure with below steps.

```
# Initialize an instance
$ initdb -D data -U postgres
# Start it as single user mode
$ postgres --single -D data/ postgres

PostgreSQL stand-alone backend 18devel
backend> SELECT pg_create_physical_replication_slot(slot_name := 'physical_slot', immediately_reserve := true);
...
backend> SELECT pg_replication_slot_advance('physical_slot', pg_current_wal_lsn());
         1: pg_replication_slot_advance (typeid = 2249, len = -1, typmod = -1, byval = f)
        ----
TRAP: failed Assert("slot != NULL && (slot->active_pid != 0)"), File: "../postgres/src/backend/replication/slot.c",
Line:674, PID: 430860 
postgres(ExceptionalCondition+0xab)[0xb86a2a]
postgres(ReplicationSlotRelease+0x5a)[0x8df10b]
postgres(pg_replication_slot_advance+0x330)[0x8e46ed]
...
```

Analysis
=====
We trapped at below assertion in ReplicationSlotRelease(). IIUC, `slot->active_pid` is set
only when the process is under the postmaster, but ReplicationSlotRelease() always requires it.

```
    Assert(slot != NULL && slot->active_pid != 0);
```

Possible fix
=======

Naively considered, there are two approaches to fix this. 1) set active_pid when even in the single
user mode [1], or 2) ease the condition to accept the situation [2]. I'm not familiar with the mode,
but [1] seems better if we want to unify codes.

Thought?

[1]:
```
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -599,7 +599,7 @@ retry:
                SpinLockRelease(&s->mutex);
        }
        else
-               active_pid = MyProcPid;
+               s->active_pid = active_pid = MyProcPid;
        LWLockRelease(ReplicationSlotControlLock);

        /*
```
[2]:
```
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -671,7 +671,8 @@ ReplicationSlotRelease(void)
        bool            is_logical = false; /* keep compiler quiet */
        TimestampTz now = 0;

-       Assert(slot != NULL && slot->active_pid != 0);
+       Assert(slot != NULL &&
+                  (slot->active_pid != 0 || !IsUnderPostmaster));

        if (am_walsender)
        {
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED




Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

From
"David G. Johnston"
Date:
On Monday, February 17, 2025, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:

backend> SELECT pg_create_physical_replication_slot(slot_name := 'physical_slot', immediately_reserve := true);

Since this function releases the slot when it returns, re-acquisition, even by the same backend, must always re-associate MyProcPid to the named slot.
 

[1]:
```
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -599,7 +599,7 @@ retry:
                SpinLockRelease(&s->mutex);
        }
        else
-               active_pid = MyProcPid;
+               s->active_pid = active_pid = MyProcPid;
        LWLockRelease(ReplicationSlotControlLock);

        /*
```

This, but you cannot modify the slot without holding the spinlock.

I’d probably add an assert that the existing state of s->active_pid is either 0 or MyProcPid already.  In single-user mode it mustn’t, really cannot, be anything else.  But the failure here is because the SQL function does a slot release; there are probably other reasonable paths where the assignment of MyProcPid during slot creation is retained and encountered during a subsequent slot acquire.

David J.

RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Michael,

> Perhaps a very naive question, but is there any point in authorizing
> manipulations of MyReplicationSlot in single-user mode, to begin with?
> With this remark, I would mean to apply a rule to
> ReplicationSlotAcquire(), so as all its callers would know about that.

According to the original thread [1], there was a wide consensus replication-related
operations can be rejected, except the slot removal. I feel this is reasonable.

Currently pg_drop_replication_slot() requires the droping slot can be acquired,
so we cannot reject it in single user mode as-is. Maybe we should revive the 0002
patch in [1] then try to do that. Thought?

[1]: https://www.postgresql.org/message-id/3b2f809f-326c-38dd-7a9e-897f957a4eb1%40enterprisedb.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED




RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Michael,

> Ah, good point for the slot drop.  So 0ce5cf2ef24f is claiming that
> some of these manipulations are OK.  I didn't suspect this one.

Yeah, I think so.

> Slot advancing is a very different beast, unfortunately, that may
> depend on many other subsystems.  For example with logical slots we
> would finish by calling rm_decode, which could be outside of core.
> Justifying that this operation is supported in single-user mode is
> larger than what you are suggesting here..

OK. Actually, I do not have any use-cases in my mind. I played because
I found a special path for single user mode in ReplicationSlotAcquire()
so that I wanted to see the behavior.

Best regards,
Hayato Kuroda
FUJITSU LIMITED




On Wed, Feb 19, 2025 at 02:57:34AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Based on the discussion, I feel it is enough to add quick error out
> for SQL functions. PSA attached.

I did not check how these call behave individually, just a few
comments while putting my eyes on the patch.

+    if (!IsUnderPostmaster)
+        elog(ERROR,
+             "slot operation is prohibited in the single user mode");

elog() should not be used for failures that can be user-facing as this
would not provide any translation.

I'd suggest rewording the error message to provide some more context,
as well, say:
"cannot use %s in single-user mode", "function_name"
--
Michael

Attachment

RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Michael,

> I did not check how these call behave individually, just a few
> comments while putting my eyes on the patch.
>
> +    if (!IsUnderPostmaster)
> +        elog(ERROR,
> +             "slot operation is prohibited in the single user mode");
>
> elog() should not be used for failures that can be user-facing as this
> would not provide any translation.

I intentionally used elog() because I thought single user mode is not user-facing.
But it is OK for me to use ereport() instead.

> I'd suggest rewording the error message to provide some more context,
> as well, say:
> "cannot use %s in single-user mode", "function_name"

Fixed. PSA new version

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

From
"David G. Johnston"
Date:
On Wed, Feb 19, 2025 at 7:23 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
I intentionally used elog() because I thought single user mode is not user-facing.
But it is OK for me to use ereport() instead.

Single-user mode is also known as "Oh crap!" mode, something used when starting the server in multi-user model fails for non-trivial reasons.

It is also what at least one PostgreSQL Online Service (a.k.a. fiddle) uses as an implementation choice.

David J.

Hi,

On Thu, Feb 20, 2025 at 02:22:41AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Dear Michael,
> 

Thanks for the report and the patch!

> > I did not check how these call behave individually, just a few
> > comments while putting my eyes on the patch.
> > 
> > +    if (!IsUnderPostmaster)
> > +        elog(ERROR,
> > +             "slot operation is prohibited in the single user mode");
> > 
> > elog() should not be used for failures that can be user-facing as this
> > would not provide any translation.
> 
> I intentionally used elog() because I thought single user mode is not user-facing.
> But it is OK for me to use ereport() instead.

Yeah, I think it's also about using ereport for cases that we think can happen (
and so needs to be translated). In this case, it clearly can happen so ereport()
is better.

> > I'd suggest rewording the error message to provide some more context,
> > as well, say:
> > "cannot use %s in single-user mode", "function_name"
> 
> Fixed. PSA new version

=== 1

Nit:

"cannot use function %s in single-user mode", "function_name"" maybe? (that would
somehow match the existing ""user-defined types cannot use subscripting function %s")

=== 2

+ "pg_copy_replication_slot")));

s/pg_copy_replication_slot/copy_replication_slot/ maybe? As it is the function
that would report the error actually (could be called from
pg_copy_logical_replication_slot_[a|b|c] and pg_copy_physical_replication_slot_[a|b]
though).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

From
"Zhijie Hou (Fujitsu)"
Date:
On Thursday, February 20, 2025 10:23 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
>
> Dear Michael,
>
> > I did not check how these call behave individually, just a few
> > comments while putting my eyes on the patch.
> >
> > +    if (!IsUnderPostmaster)
> > +        elog(ERROR,
> > +             "slot operation is prohibited in the single user
> mode");
> >
> > elog() should not be used for failures that can be user-facing as this
> > would not provide any translation.
>
> I intentionally used elog() because I thought single user mode is not
> user-facing.
> But it is OK for me to use ereport() instead.
>
> > I'd suggest rewording the error message to provide some more context,
> > as well, say:
> > "cannot use %s in single-user mode", "function_name"
>
> Fixed. PSA new version

I'm curious about the scope of the restrictions we plan to add. For example,
the current patch does not include checks in the functions used for consuming
changes (such as pg_logical_slot_get_changes). Was this omission intentional?

Best Regards,
Hou zj



On Thu, Feb 20, 2025 at 4:26 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Thursday, February 20, 2025 10:23 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Michael,
> >
> > > I did not check how these call behave individually, just a few
> > > comments while putting my eyes on the patch.
> > >
> > > +   if (!IsUnderPostmaster)
> > > +           elog(ERROR,
> > > +                    "slot operation is prohibited in the single user
> > mode");
> > >
> > > elog() should not be used for failures that can be user-facing as this
> > > would not provide any translation.
> >
> > I intentionally used elog() because I thought single user mode is not
> > user-facing.
> > But it is OK for me to use ereport() instead.
> >
> > > I'd suggest rewording the error message to provide some more context,
> > > as well, say:
> > > "cannot use %s in single-user mode", "function_name"
> >
> > Fixed. PSA new version
>
> I'm curious about the scope of the restrictions we plan to add. For example,
> the current patch does not include checks in the functions used for consuming
> changes (such as pg_logical_slot_get_changes). Was this omission intentional?
>

Also, what about pg_replication_origin_* APIs? Do we want to restrict
those as well if we are restricting slot operations? I don't see any
solid theory presented in this thread on why we should add new checks
in multiple APIs restricting those in single-user mode.

--
With Regards,
Amit Kapila.



RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Amit,

> Also, what about pg_replication_origin_* APIs? Do we want to restrict
> those as well if we are restricting slot operations? I don't see any
> solid theory presented in this thread on why we should add new checks
> in multiple APIs restricting those in single-user mode.

As David [1] and documentation [2] described, single user mode is typically used
for initialization, debugging and recovery purpose - not for normal purposes.
I think doing replication stuffs is not intended. Tom also considered like that [4].

The error I reported seemed to be introduced seven years ago (0ce5cf2), and no
one has reported till now. This also implies that there are no reasons to support
it.

Ideally, functions described in [5] can be banned in the single-user mode, except
the pg_drop_replication_slot(). Thought?

[1]: https://www.postgresql.org/message-id/CAKFQuwbnBkGZAM%2B5b1DWmbqU5W7b1r-nQsw87BukrUC5WLrJXg%40mail.gmail.com
[2]: https://www.postgresql.org/docs/devel/app-postgres.html
[3]: https://www.postgresql.org/message-id/3b2f809f-326c-38dd-7a9e-897f957a4eb1%40enterprisedb.com
[4]: https://www.postgresql.org/message-id/25024.1525789919%40sss.pgh.pa.us
[5]: https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-REPLICATION

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

Thanks everyone for giving comments! PSA new version.
What's new:

- Message format was modified to {"cannot use function %s in single-user mode", "function_name"}
- Reporting funcname was adjusted based on the parameters. ternary operator was used.
- Guard was added for functions in logicalfunction.c.

For now, functions for replication origin and replication messages were retained.
I can handle them after the discussion.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment
On Thu, Feb 20, 2025 at 6:21 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear hackers,
>
> Thanks everyone for giving comments! PSA new version.
> What's new:
>
> - Message format was modified to {"cannot use function %s in single-user mode", "function_name"}
> - Reporting funcname was adjusted based on the parameters. ternary operator was used.
> - Guard was added for functions in logicalfunction.c.
>

Shouldn't such a check be present in the CheckSlotPermissions() kind
of function to perform it in the central place?

> For now, functions for replication origin and replication messages were retained.
> I can handle them after the discussion.
>

Which other functions do we see similar restrictions? I checked
"sequence manipulation functions" (1), and "Transaction ID and
Snapshot Information Functions" (2) but couldn't see similar
restrictions.

(1) - https://www.postgresql.org/docs/current/functions-sequence.html
(2) - https://www.postgresql.org/docs/current/functions-info.html#FUNCTIONS-INFO-SNAPSHOT

--
With Regards,
Amit Kapila.