Thread: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

Hi,

On Mon, Dec 30, 2024 at 10:44:38PM -0600, Masahiko Sawada wrote:
> Hi all,
> 
> Logical decoding (and logical replication) are available only when
> wal_level = logical. As the documentation says[1], Using the 'logical'
> level increases the WAL volume which could negatively affect the
> performance. For that reason, users might want to start with using
> 'replica', but when they want to use logical decoding they need a
> server restart to increase wal_level to 'logical'. My goal is to allow
> users who are using 'replica' level to use logical decoding without a
> server restart.

Thanks for starting that thread and +1 for the idea! 

> The first idea I came up with is to make the wal_level a PGC_SIGHUP
> parameter. However, it affects not only setting 'replica' to 'logical'
> but also setting 'minimal' to 'replica' or higher. I'm not sure the
> latter case is common and it might require a checkpoint. I don't want
> to make the patch complex for uncommon cases.
> 
> The second idea is to somehow allow both WAL-logging logical info and
> logical decoding even when wal_level is 'replica'. I've attached a PoC
> patch for that. The patch introduces new SQL functions such as
> pg_activate_logical_decoding() and pg_deactivate_logical_decoding().
> These functions are available only when wal_level is 'repilca'(or
> higher). In pg_activate_logical_decoding(), we set the status of
> logical decoding stored on the shared memory from 'disabled' to
> 'xlog-logical-info', allowing all processes to write logical
> information to WAL records for logical decoding. But the logical
> decoding is still not allowed. Once we confirm all in-progress
> transactions completed, we switch the status to
> 'logical-decoding-ready', meaning that users can create logical
> replication slots and use logical decoding.
> 
> Overall, with the patch, there are two ways to enable logical
> decoding: setting wal_level to 'logical' and calling
> pg_activate_logical_decoding() when wal_level is 'replica'. I left the
> 'logical' level for backward compatibility and for users who want to
> enable the logical decoding without calling that SQL function. If we
> can automatically enable the logical decoding when creating the first
> logical replication slot, probably we no longer need the 'logical'
> level. There is room to discuss the user interface. Feedback is very
> welcome.

If we don't want to force wal_level = logical to enable logical decoding (your
second idea) then I think that it would be better to "hide" everything behind
logical replication slot creation / deletion. That would mean that having at
least one logical replication slot created would be synonym to "activate logical
decoding" and zero logical replication slot created would be synonym to "deactivate
logical decoding".

That way:

1. an end user don't need to manipulate new functions and would just rely on
replication slots existence
2. we ensure that no extra WAL data is generated if not absolutely "needed"

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Fri, Jan 3, 2025, at 10:14 AM, Bertrand Drouvot wrote:
If we don't want to force wal_level = logical to enable logical decoding (your
second idea) then I think that it would be better to "hide" everything behind
logical replication slot creation / deletion. That would mean that having at
least one logical replication slot created would be synonym to "activate logical
decoding" and zero logical replication slot created would be synonym to "deactivate
logical decoding".

I like this idea. The logical replication slot existence already has the
required protections and guarantees (no running transactions from the past while
creating) for logical decoding.

ERROR:  logical decoding requires "wal_level" >= "logical"
STATEMENT:  select pg_create_logical_replication_slot('test', 'pgoutput');

FATAL:  logical replication slot "test" exists, but "wal_level" < "logical"
HINT:  Change "wal_level" to be "logical" or higher.

Having said that, you are basically folding 'logical' machinery into 'replica'.
The 'logical' option can still exists but it will be less attractive because it
increases the WAL volume even if you are not using logical replication. I don't
know if the current 'logical' behavior (WAL records for logical decoding even
if there is no active logical replication) has any use case (maybe someone
inspects these extra records for analysis) but one suggestion (separate patch)
is to make 'logical' synonymous with the new 'replica' behavior (logical
decoding capability). This opens the door to remove 'logical' in future
releases (accepted as synonym but not documented).


--
Euler Taveira

On Fri, Jan 3, 2025 at 6:31 AM Euler Taveira <euler@eulerto.com> wrote:
>
> On Fri, Jan 3, 2025, at 10:14 AM, Bertrand Drouvot wrote:
>
> If we don't want to force wal_level = logical to enable logical decoding (your
> second idea) then I think that it would be better to "hide" everything behind
> logical replication slot creation / deletion. That would mean that having at
> least one logical replication slot created would be synonym to "activate logical
> decoding" and zero logical replication slot created would be synonym to "deactivate
> logical decoding".
>
>
> I like this idea. The logical replication slot existence already has the
> required protections and guarantees (no running transactions from the past while
> creating) for logical decoding.

I agree that it's better behavior.

>
> Having said that, you are basically folding 'logical' machinery into 'replica'.
> The 'logical' option can still exists but it will be less attractive because it
> increases the WAL volume even if you are not using logical replication. I don't
> know if the current 'logical' behavior (WAL records for logical decoding even
> if there is no active logical replication) has any use case (maybe someone
> inspects these extra records for analysis) but one suggestion (separate patch)
> is to make 'logical' synonymous with the new 'replica' behavior (logical
> decoding capability). This opens the door to remove 'logical' in future
> releases (accepted as synonym but not documented).

To enable the logical decoding automatically when the first slot is
created, probably we need to do something like:

1. enable WAL-logging logical info.
2. wait for all running transactions to finish.
3. enable logical decoding, and write a XLOG_LOGICAL_DECODING_STATUS record,
4. create a logical slot
  4-1. reserve the WAL and write an XLOG_RUNNING_XACTS record.
  4-2. wait for all transactions in the XLOG_RUNNING_XACTS record to
finish during creating the initial snapshot.

That is, we could need to wait for different sets of transactions to
finish (at 2 and 4-2). Which could surprise users since the first slot
creation could have users wait for a longer time. Using the 'logical'
level would avoid such waits.

Also, if we make the 'logical' level synonymous with the new 'replica'
behavior, we need to adjust regression tests as well. In some
regression tests (e.g., ondisk_startup.spec), we create a logical slot
while a transaction is concurrently running. We expect we wait for the
concurrent transaction during the slot creation but with the above
algorithm, we would wait for them to activate logical decoding. So I
agree that making the 'logical' level synonymous with the new
'replica' behavior is done in a separate patch. It would be a good
start to implement the new 'replica' level behavior.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Sat, Jan 4, 2025 at 6:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jan 3, 2025 at 6:31 AM Euler Taveira <euler@eulerto.com> wrote:
> >
> > On Fri, Jan 3, 2025, at 10:14 AM, Bertrand Drouvot wrote:
> >
> > If we don't want to force wal_level = logical to enable logical decoding (your
> > second idea) then I think that it would be better to "hide" everything behind
> > logical replication slot creation / deletion. That would mean that having at
> > least one logical replication slot created would be synonym to "activate logical
> > decoding" and zero logical replication slot created would be synonym to "deactivate
> > logical decoding".
> >
> >
> > I like this idea. The logical replication slot existence already has the
> > required protections and guarantees (no running transactions from the past while
> > creating) for logical decoding.
>
> I agree that it's better behavior.
>
> >
> > Having said that, you are basically folding 'logical' machinery into 'replica'.
> > The 'logical' option can still exists but it will be less attractive because it
> > increases the WAL volume even if you are not using logical replication. I don't
> > know if the current 'logical' behavior (WAL records for logical decoding even
> > if there is no active logical replication) has any use case (maybe someone
> > inspects these extra records for analysis) but one suggestion (separate patch)
> > is to make 'logical' synonymous with the new 'replica' behavior (logical
> > decoding capability). This opens the door to remove 'logical' in future
> > releases (accepted as synonym but not documented).
>
> To enable the logical decoding automatically when the first slot is
> created, probably we need to do something like:
>
> 1. enable WAL-logging logical info.
> 2. wait for all running transactions to finish.
> 3. enable logical decoding, and write a XLOG_LOGICAL_DECODING_STATUS record,
> 4. create a logical slot
>   4-1. reserve the WAL and write an XLOG_RUNNING_XACTS record.
>   4-2. wait for all transactions in the XLOG_RUNNING_XACTS record to
> finish during creating the initial snapshot.

A naive question: Can we combine step 2 and 4-2 - may be we need to
combine 4-1 and 3 too.

--
Best Wishes,
Ashutosh Bapat



On Mon, Jan 6, 2025 at 3:20 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Sat, Jan 4, 2025 at 6:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Jan 3, 2025 at 6:31 AM Euler Taveira <euler@eulerto.com> wrote:
> > >
> > > On Fri, Jan 3, 2025, at 10:14 AM, Bertrand Drouvot wrote:
> > >
> > > If we don't want to force wal_level = logical to enable logical decoding (your
> > > second idea) then I think that it would be better to "hide" everything behind
> > > logical replication slot creation / deletion. That would mean that having at
> > > least one logical replication slot created would be synonym to "activate logical
> > > decoding" and zero logical replication slot created would be synonym to "deactivate
> > > logical decoding".
> > >
> > >
> > > I like this idea. The logical replication slot existence already has the
> > > required protections and guarantees (no running transactions from the past while
> > > creating) for logical decoding.
> >
> > I agree that it's better behavior.
> >
> > >
> > > Having said that, you are basically folding 'logical' machinery into 'replica'.
> > > The 'logical' option can still exists but it will be less attractive because it
> > > increases the WAL volume even if you are not using logical replication. I don't
> > > know if the current 'logical' behavior (WAL records for logical decoding even
> > > if there is no active logical replication) has any use case (maybe someone
> > > inspects these extra records for analysis) but one suggestion (separate patch)
> > > is to make 'logical' synonymous with the new 'replica' behavior (logical
> > > decoding capability). This opens the door to remove 'logical' in future
> > > releases (accepted as synonym but not documented).

FYI one possible use case of the 'logical' level would be that if we
support retrieving WAL records for logical decoding from somewhere
like restore_command in the future, users would like to keep the
wal_level 'logical' or keep the logical decoding active.

> >
> > To enable the logical decoding automatically when the first slot is
> > created, probably we need to do something like:
> >
> > 1. enable WAL-logging logical info.
> > 2. wait for all running transactions to finish.
> > 3. enable logical decoding, and write a XLOG_LOGICAL_DECODING_STATUS record,
> > 4. create a logical slot
> >   4-1. reserve the WAL and write an XLOG_RUNNING_XACTS record.
> >   4-2. wait for all transactions in the XLOG_RUNNING_XACTS record to
> > finish during creating the initial snapshot.
>
> A naive question: Can we combine step 2 and 4-2 - may be we need to
> combine 4-1 and 3 too.

Yeah, maybe we can somewhat optimize this process. Probably we can do like:

1. create a logical slot.
2. enable WAL-logging logical info.
3. reserve the WAL and write an XLOG_RUNNING_XACTS record.
4. wait for all transactions in the XLOG_RUNNING_XACTS record to
finish during creating the initial snapshot.
  4-1. write a XLOG_LOGICAL_DECODING_STATUS record if a process
detects all transactions in the XLOG_RUNNING_XACTS record to finish.

However, we need to note that 4-1 should be done by only one process.
Also, it's a bit weird and concerns to me that the snapbuild
initialization contains the process of activating logical decoding.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Hi Sawada-San.

FWIW, I also thought it was a good idea suggested by Bertrand [1] to
"hide" everything behind the slot create/delete, and thereby eliminate
the need for user intervention using those new
pg_activate/deactivate_logical_decoding functions.

But, one concern doing it this way is how to prevent a user
(accidentally?) getting themselves into a different replication mode
without realising it? Or say they change the mode and then "forget"
that they did that.

For example, If wal_level=replica and then somebody does CREATE
PUBLICATION/SUBSCRIPTION won't that implicitly enable the logical
decoding and then leave it enabled until all the logical replication
slots are eventually dropped? Now, when the user examines wal_level it
is still going to show 'replica', which could be misleading --- e.g.
how will they even know that they can't really trust that anymore, and
they must also check the pg_get_logical_decoding_status?

At least, there should be some big NOTICE/WARNING logs written
whenever the logical decoding status is getting implicitly changed by
logical replication slot create/delete..

~~~

Meanwhile, although it is premature to give detailed code review
comments for what is still a PoC, during my reading of your patch I
made copious notes for myself, so I thought I might as well post what
I have regardless. See the details below in this post. Ignore, or do
with them as you wish. But, hopefully, a lot of the following feedback
will still be relevant regardless of any design changes.

//////////

Commit message

1.
Patch description is missing

======
GENERAL

2. git apply warning

[postgres@CentOS7-x64 oss_postgres_misc]$ git apply
../patches_misc/PoC_online_activate_logical_decoding.patch
../patches_misc/PoC_online_activate_logical_decoding.patch:93: new
blank line at EOF.
+
warning: 1 line adds whitespace errors.

~~~

3. Copyright date

All the new files of this patch have the wrong copyright date (2024
instead of 2025).

~~~

4. Missing documentation

There are missing docs changes, but that is to be expected while the
design is in flux.

~~~

5. Terminology

There are multiple places where similar terms like "active" and
"enabled" and "ready" are being used apparently interchangeably making
it quite confusing. Perhaps it was your intention to convey the idea
that the logical mode might be "enabled" but is not yet "active", but
I didn't think that was the case. Mostly it just seemed arbitrary to
me.

IMO the state starting out as "disabled" should end up as "enabled"
(not "ready").

Meanwhile, the API to change the status is currently called
"activate/deactivate" instead of "enable/disable" (??)

Meanwhile, many/most of the error messages talk about "enabled" (not
"activated" or "ready"). e.g. "logical decoding needs to be enabled to
publish logical changes"

Meanwhile, macros are called 'IsLogicalDecodingActive'; why isn't that
called 'IsLogicalDecodingEnabled'.

~

OTOH, maybe you chose the term "active" because master already had the
macro XLogLogicalInfoActive(). But in that case maybe it should be
saying inactive/active everywhere instead of disabled/enabled/ready...

~~~

6. Typedefs

There are some new typedefs introduced in the new files, So these
should be updated in the typedefs.list file.

~~~

7. Tri-state logical decoding status

Currently there is a tri-state logical decoding status: "disabled" ->
"xlog-logicalinfo" -> "ready"

AFAICT that middle state is just transitory. It is easier for me to
visualize it like:
logical decoding status: "disabled" -> "pending" -> "enabled".

IIUC the "pending" part is when it is just waiting for any running
transaction to complete.

IMO it becomes simpler if you do not make any distinction between WAL
logging and logical decoding ability. E.g. Do not allow logical WAL
logging for that 'middle' transitory state; only allow it after
logical decoding becomes fully activated.

IIUC, it would just a small change to do this:

CURRENT  (pseudocode):
XLogLogicalInfoEnabled(void) { return XLogLogicalInfoCtl->status  >=
LOGICAL_DECODING_STATUS_XLOG_LOGICALINFO; }

SUGGESTION (pseudocode)
XLogLogicalInfoEnabled(void) { return XLogLogicalInfoCtl->status ==
LOGICAL_DECODING_STATUS_READY; }

======
contrib/test_decoding/t/002_online_activation.pl

8.
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group

/2024/2025/

~~~

9.
+# check if a logical replication slot cannot be craeted while logical

/check/Check/

/cannot/can/

/craeted/created/

~~~

10.
+ok( $stderr =~ /ERROR:  logical decoding requires "wal_level" >= "logical"/,
+    "logical replication slot cannot be created while logical
decoding is disableed");

/disableed/disabled/

~~~

11.
+# Activate logical info logging.
+$node->safe_psql('postgres',
+ 'SELECT pg_activate_logical_decoding()');
+$ret = $node->safe_psql('postgres', 'SELECT pg_get_logical_decoding_status()');
+is($ret, 'ready', 'logical decoding gets activated');

Perhaps there should be another "show wal_level" done here to show
that even after logical decoding is enabled, the wal_level remains the
same.

~~~

12.
+$node->safe_psql(
+    'postgres',
+    q[
+CREATE TABLE test (a int primary key, b int);
+INSERT INTO test values (1, 100), (2, 200);
+UPDATE test SET b = b + 10 WHERE a = 1;
+    ]);
+
+$ret = $node->safe_psql(
+    'postgres',
+    q[SELECT data FROM
pg_logical_slot_get_changes('regression_slot1', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1');]
+    );
+is($ret, 'BEGIN
+table public.test: INSERT: a[integer]:1 b[integer]:100
+table public.test: INSERT: a[integer]:2 b[integer]:200
+COMMIT
+BEGIN
+table public.test: UPDATE: a[integer]:1 b[integer]:110
+COMMIT', 'logical decoding works fine');
+

There seems to be some comment missing for all this part. I expected
it to say something like "Check that logical decoding is working OK".

======
src/backend/access/heap/heapam.c

13.
- * This is only used in wal_level >= WAL_LEVEL_LOGICAL, and only for catalog
- * tuples.
+ * This is only used when XLogLogicalInfoActive() is true, and only for
+ * catalog tuples.

I felt it would be better to Assert this in the code, instead of just
mentioning it in the function comment.

======
src/backend/access/rmgrdesc/xlogdesc.c

14.
+ else if (info == XLOG_LOGICAL_DECODING_STATUS)
+ {
+ bool logical_decoding;
+
+ memcpy(&logical_decoding, rec, sizeof(bool));
+ appendStringInfoString(buf, logical_decoding ? "true" : "false");
+ }

Should there be some textual context for this bool value, like wal_level has?

SUGGESTION
appendStringInfoString(buf, "logical decoding %s", logical_decoding ?
"true" : "false");

~~~

15.
I noticed there are lots of other descriptions in this function that
are showing the value of 'wal_level' via the static function call:
get_wal_level_string(wal_level).

But, since the wal_level now might be out-of-step with the
logcal_decoding boolean value, shouldn't you be showing the logical
decoding state everywhere too?

e.g. you might have a combination like:
wal_level replica
logical decoding true

======
src/backend/access/transam/xlog.c

BootStrapXLOG:

16.
+ checkPoint.logicalDecoding = false;
  checkPoint.wal_level = wal_level;

Shouldn't the default for 'logicalDecoding' be consistent with the
assigned wal_level? It seems strange if it is possible to have a
situation where wal_level = logical but simultaneously logicalDecoding
= false.

~~~

xlog_redo:

17.
+ else if (info == XLOG_LOGICAL_DECODING_STATUS)
+ {
+ bool enabled;

A less generic variable name like "logical_decoding_enabled" might be better.

======
src/backend/commands/publicationcmds.c

CreatePublication:

18.
- if (wal_level != WAL_LEVEL_LOGICAL)
+ if (!XLogLogicalInfoActive())
  ereport(WARNING,
  (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("\"wal_level\" is insufficient to publish logical changes"),
- errhint("Set \"wal_level\" to \"logical\" before creating subscriptions.")));
+ errmsg("logical decoding needs to be enabled to publish logical changes"),
+ errhint("Set \"wal_level\" to \"logical\" or activate logical
decoding before creating subscriptions.")));

18a.
/needs to be enabled/must be enabled/

~

18b.
The hint seems less useful than it could be because it does not say
*how* the user can activate logical decoding.

SUGGESTION
Set \"wal_level\" to \"logical\" or use pg_activate_logical_decoding
before creating subscriptions.

======
src/backend/replication/logical/logical.c

CheckLogicalDecodingRequirements:

19.
  ereport(ERROR,
  (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("logical decoding on standby requires \"wal_level\" >=
\"logical\" on the primary")));
+ errmsg("logical decoding on standby requires to enable logical
decoding on the primary")));
19a
/requires to enable logical decoding/requires logical decoding to be enabled/

~

19b.
This message is like what was in CreatePublication, so might be better
split into an errmsg and errhint.

e.g.
errmsg("logical decoding on standby requires logical decoding to be
enabled on the primary");
errhint("Set \"wal_level\" to \"logical\" or use
pg_activate_logical_decoding.");

======
src/backend/replication/logical/logicalxlog.c

20. General

Would it be better if all the exposed functions from here used the
common "LogicalXLog" prefix?

These ones do:
- LogicalXlogShmemSize
- LogicalXlogShmemInit

These ones don't:
- StartupLogicalDecodingStatus
- UpdateLogicalDecodingStatus
- XLogLogicalInfoEnabled (this one is backwards)
- IsLogicalDecodingActive

~~~

21.
+ * logicalxlog.c
+ *    This module contains the codes for enabling or disabling to include
+ *    logical information into WAL records and logical decoding.

/codes/code/

Also, this is worded as if you can set those states independently, but
AFAICT the "logical information into WAL" is just a transition-state
to the LOGICAL_DECODING_STATUS_READY.

SUGGESTION
This module contains code to enable/disable and fetch the logical
decoding status.

~~~

22.
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group

/2024/2025/

~~~

23.
+XLogLogicalInfoCtlData *XLogLogicalInfoCtl = NULL;

Add blank line above this.

~~~

LogicalXlogShmemSize:

24.
+/*
+ * Initialization of shared memory.
+ */
+
+Size
+LogicalXlogShmemSize(void)
+{
+ return sizeof(XLogLogicalInfoCtlData);
+}

Unnecessary blank line after the function comment.

~~~

LogicalXlogShmemInit:

25.
+void
+LogicalXlogShmemInit(void)

Missing function comment.

~~~

StartupLogicalDecodingStatus:

26.
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * before calling StartupReplicationSlots().
+ */
+void
+StartupLogicalDecodingStatus(bool enabled_at_last_checkpoint)

The function comment does not actually say what this function does.
Maybe add a first sentence like:
"Assign the initial LogicalDecodingStatus value."

~~~

27.
+ /*
+ * On standbys, we always start with the status in the last checkpoint
+ * record. If changes of wal_level or logical decoding status is sent from
+ * the primary, we will enable or disable the logical decoding while
+ * replaying the WAL record and invalidate slots if necessary.
+ */

/is sent from/are sent from/

/invalidate/invalidating/

~~~

28.
+ if (!StandbyMode)
+ {
+ /*
+ * Disable logical decoding if replication slots are not available.
+ */
+ if (max_replication_slots == 0)
+ status = LOGICAL_DECODING_STATUS_DISABLED;
+
+ /*
+ * If previously the logical decoding was active but the server
+ * restarted with wal_level < 'replica', we disable the logical
+ * decoding.
+ */
+ if (wal_level < WAL_LEVEL_REPLICA)
+ status = LOGICAL_DECODING_STATUS_DISABLED;
+
+ /*
+ * Setting wal_level to 'logical' immediately enables logical decoding
+ * and WAL-logging logical info.
+ */
+ if (wal_level >= WAL_LEVEL_LOGICAL)
+ status = LOGICAL_DECODING_STATUS_READY;
+ }

28a.
The comment is a bit confusing because it is only talking about
Standby, whereas the code block is all for !StandbyMode.

It would be more easy to read if refactored to put the comment in an
empty code block like below:

if (StandByMode)
{
  /*
   * On standbys, we always start with the status in the last checkpoint
   * record...
   * ...
   */
}
else
{
<code>
}

~

28b.
What if the first condition (max_replication_slots is 0) and last
condition (WAL_LEVEL_LOGICAL) are true at the same time? Currently
this initializes the status as READY. Is it OK?

~~~

UpdateLogicalDecodingStatus:

29.
+void
+UpdateLogicalDecodingStatus(bool activate)
+{
+ XLogLogicalInfoCtl->status = activate
+ ? LOGICAL_DECODING_STATUS_READY
+ : LOGICAL_DECODING_STATUS_DISABLED;
+}

29a.
Missing function comment.

~

29b.
IMO the parameter should be in past tense, but I also think should be
called "enabled" (same as what it was called in the extern in
logicalxlog.h)

~~~

XLogLogicalInfoEnabled:

30.
+/*
+ * Is WAL-Logging logical info enabled?
+ */
+bool inline
+XLogLogicalInfoEnabled(void)
+{
+ /*
+ * Use volatile pointer to make sure we make a fresh read of the shared
+ * variable.
+ */
+ volatile XLogLogicalInfoCtlData *ctl = XLogLogicalInfoCtl;
+
+ return ctl->status >= LOGICAL_DECODING_STATUS_XLOG_LOGICALINFO;
+}

How necessary is this function?

I thought that LOGICAL_DECODING_STATUS_XLOG_LOGICALINFO is more like
just a transitory state between LOGICAL_DECODING_STATUS_DISABLED and
LOGICAL_DECODING_STATUS_READY. So in that case, isn't it simpler just
to wait for the READY? e.g. remove this function and just use
IsLogicalDecodingActive() instead?

~~~

do_activate_logical_decoding:

31.
Missing function comment.

~~~

32.
+ /*
+ * Get the latest status of logical info logging. If it's already
+ * activated, quick return.
+ */
+ SpinLockAcquire(&XLogLogicalInfoCtl->mutex);
+ if (XLogLogicalInfoCtl->status >= LOGICAL_DECODING_STATUS_XLOG_LOGICALINFO)
+ {
+ SpinLockRelease(&XLogLogicalInfoCtl->mutex);
+ return;
+ }

32a.
I didn't think the comment should mention "logical info logging" here.

SUGGESTION
Quick return if logical decoding is already activated (or activation
was already requested and is pending).

~

32b.
Related to this, it seems simpler to check condition "if
(XLogLogicalInfoCtl->status > LOGICAL_DECODING_STATUS_DISABLED)"

~~~

33.
+ PG_ENSURE_ERROR_CLEANUP(logical_decoding_activation_abort_callback, 0);
+ {
...
+ }
+ PG_END_ENSURE_ERROR_CLEANUP(logical_decoding_activation_abort_callback, 0);

IIUC,  this whole chunk of code lies between the status of
"xlog-logicalinfo" and "ready".

I think it needs a comment something like:
/* Confirm that all in-progress transactions have completed before
logical decoding can be enabled. */

~~~

logical_decoding_activation_abort_callback:

34.
Missing function comment.

~~~

35.
Since this is a callback for the function
do_activate_logical_decoding(), I thought the function should be
slightly renamed to match the caller.

/logical_decoding_activation_abort_callback/activate_logical_decoding_abort_callback/

~~~

36.
IIUC, the status could have reached
LOGICAL_DECODING_STATUS_XLOG_LOGICALINFO before we encountered some
problem trying to get to the READY status, and so reset everything
back to DISABLED. So some (unusable?) logical WAL records could have
been written while in that intermediate state? It is another reason
why I was unsure about the need for the earlier XLogLogicalInfoEnabled
function.

Won't it be simpler to just have a DISABLED and ENABLED state and just
keep the transition phase for internal logic?  IOW, make
XLogLogicalInfoEnabled() return true only when READY/ENABLED.

~~~
pg_activate_logical_decoding:

37.
Missing function comment.

~~~

38.
+ if (IsTransactionState() &&
+ GetTopTransactionIdIfAny() != InvalidTransactionId)
+ ereport(ERROR,
+ (errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+ errmsg("cannot activate logical decoding in transaction that has
performed writes")));
+
+

Double blank lines.

~~~

39.
+ SpinLockAcquire(&(XLogLogicalInfoCtl->mutex));
+ status = XLogLogicalInfoCtl->status;
+ SpinLockRelease(&(XLogLogicalInfoCtl->mutex));
+
+ if (status == LOGICAL_DECODING_STATUS_READY)
+ PG_RETURN_VOID();

Should have some comment like: "If logical decoding is already enabled
then there is nothing to do."

OTOH, can't all this code be removed, since the top-of-loop code will
do the same thing anyway (I didn't think the extra sleep
prepare/cancel that it is doing was something to worry about).

~~~

40.
+ ConditionVariableSleep(&XLogLogicalInfoCtl->cv,
+    WAIT_EVENT_LOGICAL_DECODING_ACTIVATION);
+
+ continue;

Remove the blank line.

~~~

41.
+ /* Activate logical decoding */
+ do_activate_logical_decoding();

The comment is not helpful because it is practically the same as the
function name.

~~~

pg_deactivate_logical_decoding:

42.
Missing function comment.

~~~

43.
+ /*
+ * Hold ReplicationSlotAllocationLock to prevent slots from newly
+ * being created while deactivating the logical decoding.
+ */
+ LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);

/newly being created/being created/

~~~

44.
+ for (int i = 0; i < max_replication_slots; i++)
+ {
+ ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
+
+ if (!s->in_use)
+ continue;
+
+ if (SlotIsPhysical(s))
+ continue;
+
+ if (s->data.invalidated != RS_INVAL_NONE)
+ continue;
+
+ valid_logical_slots++;
+ }

IMO it is more natural here to check "if (!SlotIsLogical(s))" instead
of "if (SlotIsPhysical(s))".

~~~

45.
+ if (valid_logical_slots > 0)
+ ereport(ERROR,
+ (errmsg("cannot deactivate logical decoding while having valid
logical replication slots")));

/while having valid logical replication slots/while valid logical
replication slots are present/

~~~

46.
+ if (XLogStandbyInfoActive() && !recoveryInProgress)
+ {
+ bool enabled = false;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) (&enabled), sizeof(bool));
+
+ XLogInsert(RM_XLOG_ID, XLOG_LOGICAL_DECODING_STATUS);
+ }
+
+ SpinLockAcquire(&(XLogLogicalInfoCtl->mutex));
+ XLogLogicalInfoCtl->status = LOGICAL_DECODING_STATUS_DISABLED;
+ SpinLockRelease(&(XLogLogicalInfoCtl->mutex));

In the pg_activate_logical_decoding function, the assignment of the
status and the writing of the XLog record was done within a critical
section. Why isn't the code doing the same here?

~~~

pg_get_logical_decoding_status:

47.
Missing function comment.

======
src/backend/replication/logical/slotsync.c

ValidateSlotSyncParams:

48.
- if (wal_level < WAL_LEVEL_LOGICAL)
+ if (!XLogLogicalInfoActive())
  ereport(ERROR,
  errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("replication slot synchronization requires \"wal_level\" >=
\"logical\""));
+ errmsg("replication slot synchronization requires \"wal_level\" >=
\"logical\" or to activate logical decoding"));


This is another message like what was in CreatePublication, and that
might be better split into an errmsg and errhint.

e.g.
errmsg("replication slot synchronization requires logical decoding to
be enabled");
errhint("Set \"wal_level\" to \"logical\" or use
pg_activate_logical_decoding.");

======
src/backend/replication/slot.c

RestoreSlotFromDisk:

49.
+ if (cp.slotdata.database != InvalidOid && !XLogLogicalInfoActive())
  ereport(FATAL,
  (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  errmsg("logical replication slot \"%s\" exists, but \"wal_level\" <
\"logical\"",
  NameStr(cp.slotdata.name)),
- errhint("Change \"wal_level\" to be \"logical\" or higher.")));
+ errhint("Change \"wal_level\" to be \"logical\" or higher, or
activate logical decoding with \"replica\".")));

Another message that could be made more consistent with the similar
ones from CreatePublication, ValidateSlotSyncParams, etc...

e.g.
errmsg("logical replication slot \"%s\" exists, but logical decoding
is not enabled");
errhint("Set \"wal_level\" to \"logical\" or use
pg_activate_logical_decoding.");

======
src/backend/utils/activity/wait_event_names.txt

50.
+LOGICAL_DECODING_ACTIVATION "Waiting for logical decoding to be activated."

Should this be more like "Waiting for a requested logical decoding
activation to complete."

======
src/include/access/xlog.h

51.
-#define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL)
+#define XLogLogicalInfoActive() (XLogLogicalInfoEnabled())

Those extra parentheses seem redundant now.

======
src/include/replication/logicalxlog.h

52.
+/*-------------------------------------------------------------------------
+ * logicalxlog.h
+ *   Exports from replication/logical/logicalxlog.c.
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ *-------------------------------------------------------------------------

/2024/2025/

~~~

53.
+typedef enum LogicalDecodingStatus
+{
+ LOGICAL_DECODING_STATUS_DISABLED = 0,
+ LOGICAL_DECODING_STATUS_XLOG_LOGICALINFO,
+ LOGICAL_DECODING_STATUS_READY,
+} LogicalDecodingStatus;

Maybe some more comments about these status phases would be helpful.
Specifically, it's not immediately clear if the
LOGICAL_DECODING_STATUS_XLOG_LOGICALINFO is a real state we could find
ourselves permanently, or if it is more like just a temporary
"pending" state while transitioning from DISABLED to READY.

AFAICT it is just transitory, so in that case, maybe
LOGICAL_DECODING_PENDING might be a more appropriate name.

~~~

54.
+typedef struct XLogLogicalInfoCtlData XLogLogicalInfoCtlData;
+extern XLogLogicalInfoCtlData *XLogLogicalInfoCtl;
+extern bool XLogLogicalInfo;

Add a blank line between the typedef and the externs.

======
[1] Bertrand's hide everything idea -
https://www.postgresql.org/message-id/Z3fimYj0fbkLmWJb%40ip-10-97-1-34.eu-west-3.compute.internal

Kind Regards,
Peter Smith.
Fujitsu Australia