Thread: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

On Tue, Dec 31, 2024 at 10:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi all,
>
> Logical decoding (and logical replication) are available only when
> wal_level = logical. As the documentation says[1], Using the 'logical'
> level increases the WAL volume which could negatively affect the
> performance. For that reason, users might want to start with using
> 'replica', but when they want to use logical decoding they need a
> server restart to increase wal_level to 'logical'. My goal is to allow
> users who are using 'replica' level to use logical decoding without a
> server restart. There are other GUC parameters related to logical
> decoding and logical replication such as max_wal_senders,
> max_logical_replication_workers, and max_replication_slots, but even
> if users set these parameters >0, there would not be a noticeable
> performance impact. And their default values are already >0. So I'd
> like to focus on making only the wal_level dynamic GUC parameter.
> There are several earlier discussions[2][3] but no one has submitted
> patches unless I'm missing something.
>
> The first idea I came up with is to make the wal_level a PGC_SIGHUP
> parameter. However, it affects not only setting 'replica' to 'logical'
> but also setting 'minimal' to 'replica' or higher. I'm not sure the
> latter case is common and it might require a checkpoint. I don't want
> to make the patch complex for uncommon cases.
>
> The second idea is to somehow allow both WAL-logging logical info and
> logical decoding even when wal_level is 'replica'. I've attached a PoC
> patch for that. The patch introduces new SQL functions such as
> pg_activate_logical_decoding() and pg_deactivate_logical_decoding().
> These functions are available only when wal_level is 'repilca'(or
> higher). In pg_activate_logical_decoding(), we set the status of
> logical decoding stored on the shared memory from 'disabled' to
> 'xlog-logical-info', allowing all processes to write logical
> information to WAL records for logical decoding. But the logical
> decoding is still not allowed. Once we confirm all in-progress
> transactions completed, we switch the status to
> 'logical-decoding-ready', meaning that users can create logical
> replication slots and use logical decoding.
>
> Overall, with the patch, there are two ways to enable logical
> decoding: setting wal_level to 'logical' and calling
> pg_activate_logical_decoding() when wal_level is 'replica'. I left the
> 'logical' level for backward compatibility and for users who want to
> enable the logical decoding without calling that SQL function. If we
> can automatically enable the logical decoding when creating the first
> logical replication slot, probably we no longer need the 'logical'
> level. There is room to discuss the user interface. Feedback is very
> welcome.
>

If a server is running at minimal wal_level and they want to enable
logical replication, they would still need a server restart. That
would be rare but not completely absent.

Our documentation says "wal_level determines how much information is
written to the WAL.". Users would may not expect that the WAL amount
changes while wal_level = replica depending upon whether logical
decoding is possible. It may be possible to set the expectations right
by changing the documentation. It's not in the patch, so I am not sure
whether this is considered.

Cloud providers do not like multiple ways of changing configuration
esp. when they can not control it. See [1]. Changing wal_level through
a SQL function may fit the same category.

I agree that it would be a lot of work to make all combinations of
wal_level changes work, but changing wal_level through SIGHUP looks
like a cleaner solution. Is there way that we make the GUC SIGHUP but
disallow certain combinations of old and new values?

[1]
https://www.postgresql.org/message-id/flat/CA%2BVUV5rEKt2%2BCdC_KUaPoihMu%2Bi5ChT4WVNTr4CD5-xXZUfuQw%40mail.gmail.com

--
Best Wishes,
Ashutosh Bapat



RE: POC: enable logical decoding when wal_level = 'replica' without a server restart

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Sawada-san,

I love the idea. I've roughly tested the patch and worked on my env.
Here are initial comments...

1. xloglevelworker.c
```
+#include "replication/logicalxlog.h"
```

xloglevelworker.c includes replication/logicalxlog.h, but it does not exist.
The line had to be removed to build and test it.

2.
```
+static void
+writeUpdateWalLevel(int new_wal_level)
+{
+       XLogBeginInsert();
+       XLogRegisterData((char *) (&new_wal_level), sizeof(bool));
+       XLogInsert(RM_XLOG_ID, XLOG_UPDATE_WAL_LEVEL);
+}
```

IIUC the data length should be sizeof(int) instead of sizeof(bool).

3.
Is there a reason why the process does not wait till the archiver exits?

4.
When I dumped wal files, I found that XLOG_UPDATE_WAL_LEVEL cannot be recognized:

```
rmgr: XLOG        len (rec/tot):     27/    27, tx:          0, lsn: 0/03050838, prev 0/03050800, desc: UNKNOWN (f0)
wal_levellogical
 
```

xlog_identify() must be updated as well.

5.
When I changed "logical" to "replica", postgres outputs like below:

```
LOG:  received SIGHUP, reloading configuration files
LOG:  parameter "wal_level" changed to "replica"
LOG:  wal_level control worker started
LOG:  changing wal_level from "logical" to "replica"
LOG:  wal_level has been decreased to "replica"
LOG:  successfully changed wal_level from "logical" to "replica"
```

ISTM that both postmaster and the wal_level control worker said something like
"wal_level changed", which is bit strange for me. Since GUC can't be renamed,
can we use another name for the wal_level control state?

6.
With the patch present, the wal_level can be changed to the minimal even when the
streaming replication is going. If we do that, the walsender exits immediately and
the below FATAL appears periodically until the standby stops. Same things can be
said for the logical replication:

```
FATAL:  streaming replication receiver "walreceiver" could not connect to the primary server:
connection to server on socket "/tmp/.s.PGSQL.oooo" failed:
FATAL:  WAL senders require "wal_level" to be "replica" or "logical
```

I know this is not a perfect, but can we avoid the issue by reject the GUC update
if the walsender exists? Another approach is not to update the value when replication
slots need to be invalidated.

----------
Best regards,
Haato Kuroda


On Tue, Jan 28, 2025 at 1:39 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> I love the idea. I've roughly tested the patch and worked on my env.
> Here are initial comments...

Thank you for looking at the patch!

>
> 1. xloglevelworker.c
> ```
> +#include "replication/logicalxlog.h"
> ```
>
> xloglevelworker.c includes replication/logicalxlog.h, but it does not exist.
> The line had to be removed to build and test it.
>
> 2.
> ```
> +static void
> +writeUpdateWalLevel(int new_wal_level)
> +{
> +       XLogBeginInsert();
> +       XLogRegisterData((char *) (&new_wal_level), sizeof(bool));
> +       XLogInsert(RM_XLOG_ID, XLOG_UPDATE_WAL_LEVEL);
> +}
> ```
>
> IIUC the data length should be sizeof(int) instead of sizeof(bool).

Agreed to fix them.

>
> 3.
> Is there a reason why the process does not wait till the archiver exits?

No. I didn't implement this part as the patch was just for
proof-of-concept. I think it would be better to wait for it to exit.

>
> 4.
> When I dumped wal files, I found that XLOG_UPDATE_WAL_LEVEL cannot be recognized:
>
> ```
> rmgr: XLOG        len (rec/tot):     27/    27, tx:          0, lsn: 0/03050838, prev 0/03050800, desc: UNKNOWN (f0)
wal_levellogical 
> ```
>
> xlog_identify() must be updated as well.

Will fix.

>
> 5.
> When I changed "logical" to "replica", postgres outputs like below:
>
> ```
> LOG:  received SIGHUP, reloading configuration files
> LOG:  parameter "wal_level" changed to "replica"
> LOG:  wal_level control worker started
> LOG:  changing wal_level from "logical" to "replica"
> LOG:  wal_level has been decreased to "replica"
> LOG:  successfully changed wal_level from "logical" to "replica"
> ```
>
> ISTM that both postmaster and the wal_level control worker said something like
> "wal_level changed", which is bit strange for me. Since GUC can't be renamed,
> can we use another name for the wal_level control state?

I'm concerned that users could be confused if two different names
refer to substantially the same thing.

Having said that, I guess that we need to drastically change the
messages. For example, I think that the wal_level worker should say
something like "successfully made 'logical' wal_level effective"
instead of saying something like "changed wal_level value". Also,
users might not need gradual messages when increasing 'minimal' to
'logical' or decreasing 'logical' to 'minimal'.

>
> 6.
> With the patch present, the wal_level can be changed to the minimal even when the
> streaming replication is going. If we do that, the walsender exits immediately and
> the below FATAL appears periodically until the standby stops. Same things can be
> said for the logical replication:
>
> ```
> FATAL:  streaming replication receiver "walreceiver" could not connect to the primary server:
> connection to server on socket "/tmp/.s.PGSQL.oooo" failed:
> FATAL:  WAL senders require "wal_level" to be "replica" or "logical
> ```
>
> I know this is not a perfect, but can we avoid the issue by reject the GUC update
> if the walsender exists? Another approach is not to update the value when replication
> slots need to be invalidated.

Does it mean that we reject the config file from being reloaded in
that case? I have no idea how to reject it in a case where the
wal_level in postgresql.conf changed and the user did 'pg_ctl reload'.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Mon, Feb 3, 2025 at 3:40 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > I'm concerned that users could be confused if two different names
> > refer to substantially the same thing.
> >
> > Having said that, I guess that we need to drastically change the
> > messages. For example, I think that the wal_level worker should say
> > something like "successfully made 'logical' wal_level effective"
> > instead of saying something like "changed wal_level value". Also,
> > users might not need gradual messages when increasing 'minimal' to
> > 'logical' or decreasing 'logical' to 'minimal'.
>
> +1 for something like "successfully made 'logical' wal_level effective", and
> removing gradual messages.
>
> > > 6.
> > > With the patch present, the wal_level can be changed to the minimal even when
> > the
> > > streaming replication is going. If we do that, the walsender exits immediately
> > and
> > > the below FATAL appears periodically until the standby stops. Same things can
> > be
> > > said for the logical replication:
> > >
> > > ```
> > > FATAL:  streaming replication receiver "walreceiver" could not connect to the
> > primary server:
> > > connection to server on socket "/tmp/.s.PGSQL.oooo" failed:
> > > FATAL:  WAL senders require "wal_level" to be "replica" or "logical
> > > ```
> > >
> > > I know this is not a perfect, but can we avoid the issue by reject the GUC update
> > > if the walsender exists? Another approach is not to update the value when
> > replication
> > > slots need to be invalidated.
> >
> > Does it mean that we reject the config file from being reloaded in
> > that case? I have no idea how to reject it in a case where the
> > wal_level in postgresql.conf changed and the user did 'pg_ctl reload'.
>
> I imagined like attached. When I modified wal_level to minimal and send SIGHUP,
> postmaster reported below lines and failed to update wal_level.
>
> ```
> LOG:  received SIGHUP, reloading configuration files
> LOG:  wal_level cannot be set to "minimal" while walsender exists
> LOG:  configuration file "...postgresql.conf" contains errors; unaffected changes were applied
> ```

Interesting, and thanks for sharing the patch. But I think that when
we change the wal_level to 'minimal', there is a window where a new
walsender can launch after passing the check_wal_level() check.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Hi,

On Tue, Feb 11, 2025 at 02:11:10PM -0800, Masahiko Sawada wrote:
> I've updated the patch that includes comment updates and bug fixes.

Thanks!

> The main idea of changing WAL level online is to decouple two aspects:
> (1) the information included in WAL records and (2) the
> functionalities available at each WAL level. With that, we can change
> the WAL level gradually. For example, when increasing the WAL level
> from 'replica' to 'logical', we first switch the WAL level on the
> shared memory to a new higher level where we allow processes to write
> WAL records with additional information required by the logical
> decoding, while keeping the logical decoding unavailable. The new
> level is something between 'replica' and 'logical'. Once we confirm
> all processes have synchronized to the new level, we increase the WAL
> level further to 'logical', allowing us to start logical decoding. The
> patch supports all combinations of WAL level transitions. It makes
> sense to me to use a background worker to proceed with this transition
> work since we need to wait at some points, rather than delegating it
> to the checkpointer process.

The background worker being added is "wal_level control worker". I wonder if
it would make sense to create a more "generic" one instead (to whom we could 
assign more "tasks" later on, as suggested in the past in [1]).

+   /*
+    * XXX: Perhaps it's not okay that we failed to launch a bgworker and give
+    * up wal_level change because we already reported that the change has
+    * been accepted. Do we need to use aux process instead for that purpose?
+    */
+   if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+       ereport(WARNING,
+               (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+                errmsg("out of background worker slots"),
+                errhint("You might need to increase \"%s\".", "max_worker_processes")));

Not sure it has to be an aux process instead as it should be busy in rare occasions.

Maybe we could add some mechanism for ensuring that a bgworker slot is available
when needed (as suggested in [2])?

Not saying it has to be done that way. I just thought that the "wal_level control worker"
could be a perfect use case/starting point for a more generic one but I don't want
to over complicate that thread though.

So maybe just rename "wal_level control worker" to say "custodian worker" and
we could also think about [2]? Feel free to consider all of this as Nits if you
feel it deviates too much from the initial intend of this thread.

[1]: https://www.postgresql.org/message-id/flat/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com
[2]: https://www.postgresql.org/message-id/1058306.1680467858%40sss.pgh.pa.us

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Tue, Feb 11, 2025 at 11:44 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Tue, Feb 11, 2025 at 02:11:10PM -0800, Masahiko Sawada wrote:
> > I've updated the patch that includes comment updates and bug fixes.
>
> Thanks!
>
> > The main idea of changing WAL level online is to decouple two aspects:
> > (1) the information included in WAL records and (2) the
> > functionalities available at each WAL level. With that, we can change
> > the WAL level gradually. For example, when increasing the WAL level
> > from 'replica' to 'logical', we first switch the WAL level on the
> > shared memory to a new higher level where we allow processes to write
> > WAL records with additional information required by the logical
> > decoding, while keeping the logical decoding unavailable. The new
> > level is something between 'replica' and 'logical'. Once we confirm
> > all processes have synchronized to the new level, we increase the WAL
> > level further to 'logical', allowing us to start logical decoding. The
> > patch supports all combinations of WAL level transitions. It makes
> > sense to me to use a background worker to proceed with this transition
> > work since we need to wait at some points, rather than delegating it
> > to the checkpointer process.
>
> The background worker being added is "wal_level control worker". I wonder if
> it would make sense to create a more "generic" one instead (to whom we could
> assign more "tasks" later on, as suggested in the past in [1]).
>
> +   /*
> +    * XXX: Perhaps it's not okay that we failed to launch a bgworker and give
> +    * up wal_level change because we already reported that the change has
> +    * been accepted. Do we need to use aux process instead for that purpose?
> +    */
> +   if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
> +       ereport(WARNING,
> +               (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
> +                errmsg("out of background worker slots"),
> +                errhint("You might need to increase \"%s\".", "max_worker_processes")));
>
> Not sure it has to be an aux process instead as it should be busy in rare occasions.

Thank you for referring to the custodian worker thread. I'm not sure
that online wal_level change work would fit the concept of custodian
worker, which offloads some work for time-critical works such as
checkpointing, but this idea made me think of other possible
directions of this work.

Looking at the latest custodian worker patch, the basic architecture
is to have a single custodian worker and processes can ask it for some
work such as removing logical decoding related files. The online
wal_level change will be the one of the tasks that processes (eps.
checkpointer) can ask for it. On the other hand, one point that I
think might not fit this wal_level work well is that while the
custodian worker is a long-lived worker process, it's sufficient for
the online wal_level change work to have a bgworker that does its work
and then exits. IOW, from the perspective of this work, I prefer the
idea of having one short-lived worker for one task over having one
long-lived worker for multiple tasks. Reading that thread, while we
need to resolve the XID wraparound issue for the work of removing
logical decoding related files, the work of removing temporary files
seems to fit a short-lived worker style. So I thought as one of the
directions, it might be worth considering to have an infrastructure
where we can launch a bgworker just for one task, and we implement the
online wal_level change and temporary files removal on top of it.

> Maybe we could add some mechanism for ensuring that a bgworker slot is available
> when needed (as suggested in [2])?

Yeah, we need this mechanism if we use a bgworker for these works.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Hi,

On Fri, Feb 14, 2025 at 12:17:48AM -0800, Masahiko Sawada wrote:
> On Tue, Feb 11, 2025 at 11:44 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:

> Looking at the latest custodian worker patch, the basic architecture
> is to have a single custodian worker and processes can ask it for some
> work such as removing logical decoding related files. The online
> wal_level change will be the one of the tasks that processes (eps.
> checkpointer) can ask for it. On the other hand, one point that I
> think might not fit this wal_level work well is that while the
> custodian worker is a long-lived worker process,

That was the case initialy but it looks like it would not have been the case
at the end. See, Tom's comment in [1]:

"
I wonder if a single long-lived custodian task is the right model at all.
At least for RemovePgTempFiles, it'd make more sense to write it as a
background worker that spawns, does its work, and then exits,
independently of anything else
"

> it's sufficient for
> the online wal_level change work to have a bgworker that does its work
> and then exits.

Fully agree and I did not think about changing this behavior.

> IOW, from the perspective of this work, I prefer the
> idea of having one short-lived worker for one task over having one
> long-lived worker for multiple tasks.

Yeah, or one short-lived worker for multiple tasks could work too. It just 
starts when it has something to do and then exit.

> Reading that thread, while we
> need to resolve the XID wraparound issue for the work of removing
> logical decoding related files, the work of removing temporary files
> seems to fit a short-lived worker style. So I thought as one of the
> directions, it might be worth considering to have an infrastructure
> where we can launch a bgworker just for one task, and we implement the
> online wal_level change and temporary files removal on top of it.

Yeap, that was exactly my point when I mentioned the custodian thread (taking
into account Tom's comment quoted above).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Hi,

On Mon, Feb 17, 2025 at 12:07:56PM -0800, Masahiko Sawada wrote:
> On Fri, Feb 14, 2025 at 2:35 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > Yeap, that was exactly my point when I mentioned the custodian thread (taking
> > into account Tom's comment quoted above).
> >
> 
> I've written PoC patches to have the online wal_level change work use
> a more generic infrastructure. These patches are still in PoC state
> but seem like a good direction to me. Here is a brief explanation for
> each patch.

Thanks for the patches!

> * The 0001 patch introduces "reserved background worker slots". We
> allocate max_process_workers + BGWORKER_CLASS_RESERVED at startup, and
> if the number of running bgworker exceeds max_worker_processes, only
> workers using the reserved slots can be launched. We can request to
> use the reserved slots by adding BGWORKER_CLASS_RESERVED flag at
> bgworker registration.

I had a quick look at 0001 and I think the way that's implemented is reasonnable.
I thought this could be defined through a GUC so that extensions can benefit
from it. But OTOH the core code should ensure the value is > as the number of
reserved slots needed by the core so not using a GUC looks ok to me.

> * The 0002 patch introduces "bgtask worker". The bgtask infrastructure
> is designed to execute internal tasks in background in
> one-worker-per-one-task style. Internally, bgtask workers use the
> reserved bgworker so it's guaranteed that they can launch.

Yeah.

> The
> internal tasks that we can request are predefined and this patch has a
> dummy task as a placeholder. This patch implements only the minimal
> functionality for the online wal_level change work. I've not tested if
> this bgtask infrastructure can be used for tasks that we wanted to
> offload to the custodian worker.

Again, I had a quick look and looks simple enough of our need here. It "just"
executes "(void) InternalBgTasks[type].func()" and then exists. That's, I think,
a good starting point to add more tasks in the future (if we want to).

> * The 0003 patch makes wal_level a SIGHUP parameter. We do the online
> wal_level change work using the bgtask infrastructure. There are no
> major changes from the previous version other than that.

It replaces the dummy task introduced in 0002 by the one that suits our needs
here (through the new BgTaskWalLevelChange() function).

The design looks reasonable to me. Waiting to see if others disagree before
looking more closely at the code.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Wed, Feb 19, 2025 at 1:56 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,

Thank you for looking at the patches.

>
> On Mon, Feb 17, 2025 at 12:07:56PM -0800, Masahiko Sawada wrote:
> > On Fri, Feb 14, 2025 at 2:35 AM Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> > > Yeap, that was exactly my point when I mentioned the custodian thread (taking
> > > into account Tom's comment quoted above).
> > >
> >
> > I've written PoC patches to have the online wal_level change work use
> > a more generic infrastructure. These patches are still in PoC state
> > but seem like a good direction to me. Here is a brief explanation for
> > each patch.
>
> Thanks for the patches!
>
> > * The 0001 patch introduces "reserved background worker slots". We
> > allocate max_process_workers + BGWORKER_CLASS_RESERVED at startup, and
> > if the number of running bgworker exceeds max_worker_processes, only
> > workers using the reserved slots can be launched. We can request to
> > use the reserved slots by adding BGWORKER_CLASS_RESERVED flag at
> > bgworker registration.
>
> I had a quick look at 0001 and I think the way that's implemented is reasonnable.
> I thought this could be defined through a GUC so that extensions can benefit
> from it. But OTOH the core code should ensure the value is > as the number of
> reserved slots needed by the core so not using a GUC looks ok to me.

Interesting idea. I kept the reserved slots only for internal use but
it would be worth considering to use GUC instead.

> > * The 0002 patch introduces "bgtask worker". The bgtask infrastructure
> > is designed to execute internal tasks in background in
> > one-worker-per-one-task style. Internally, bgtask workers use the
> > reserved bgworker so it's guaranteed that they can launch.
>
> Yeah.
>
> > The
> > internal tasks that we can request are predefined and this patch has a
> > dummy task as a placeholder. This patch implements only the minimal
> > functionality for the online wal_level change work. I've not tested if
> > this bgtask infrastructure can be used for tasks that we wanted to
> > offload to the custodian worker.
>
> Again, I had a quick look and looks simple enough of our need here. It "just"
> executes "(void) InternalBgTasks[type].func()" and then exists. That's, I think,
> a good starting point to add more tasks in the future (if we want to).

Yeah, we might want to extend it further, for example to pass an
argument to the background task or to ask multiple tasks for the
single bgtask worker. As far as I can read the custodian patch set,
the work of removing temp files seems not to require any argument
though.

>
> > * The 0003 patch makes wal_level a SIGHUP parameter. We do the online
> > wal_level change work using the bgtask infrastructure. There are no
> > major changes from the previous version other than that.
>
> It replaces the dummy task introduced in 0002 by the one that suits our needs
> here (through the new BgTaskWalLevelChange() function).
>
> The design looks reasonable to me. Waiting to see if others disagree before
> looking more closely at the code.

Thanks.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com