After upgrade from Pg11.2 to 17.7 logical replication prevents database instance shutdown - Mailing list pgsql-general
| From | Aleš Zelený |
|---|---|
| Subject | After upgrade from Pg11.2 to 17.7 logical replication prevents database instance shutdown |
| Date | |
| Msg-id | CAODqTUZXgywhJXGK1UmaWtJDVuzXXUYG4-DTCuV0VkB--+SCWA@mail.gmail.com Whole thread Raw |
| List | pgsql-general |
Hello,
We have recently upgraded from PostgreSQL 11.2 to PostgreSQL 17.7. We have logical replication between two database instances; no third-party CDC consumers are used.
During low traffic on the publisher database, there are no issues, and the publisher instance shutdown is smooth, as expected.
If we request a shutdown in a condition where there is a replication lag from the publisher to the subscriber instance (systemctl stop .... which is defined in the systems unit as
ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast %i stop) the shutdown hangs for exactly 30 minutes from the "
received fast shutdown request" message in the database log with log message (... 0 5029/2736 sub_xxx_usd START_REPLICATION [57P01]:FATAL: terminating connection due to administrator command).
We have checked the corresponding logs from PG 11.2, it took exactly 60 seconds.
We have also tried setting
checkpoint_timeout = 27min and archive_timeout = 23min to make sure the delayed shutdown is not related to these parameters, and still the shutdown is blocked just for 30 minutes.If we disable the subscription, the shutdown is smooth; that is why we suspect some change in logical replication, or there are some new configuration parameters we have missed to let publisher instance shutdown cleanly without that long delay, and finally terminating the sender process on the publisher instance.
PostgreSQL version:
PostgreSQL 17.7 (Ubuntu 17.7-3.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0, 64-bit
Timeouts:
publisher instance:
powa=# show wal_sender_timeout;
wal_sender_timeout
--------------------
10min
(1 row)
wal_sender_timeout
--------------------
10min
(1 row)
subscriber instance:
powa=# show wal_receiver_timeout;
wal_receiver_timeout
----------------------
10min
(1 row)
powa=# show wal_receiver_timeout;
wal_receiver_timeout
----------------------
10min
(1 row)
OS version:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.5 LTS
Release: 22.04
Codename: jammy
Distributor ID: Ubuntu
Description: Ubuntu 22.04.5 LTS
Release: 22.04
Codename: jammy
We have found https://github.com/postgres/postgres/commit/5231ed8262c94936a69bce41f64076630bbd99a2, not sure whether it applies to the behavior change described above.
Also, the "walsender.c" comment seems to explain that the shutdown is intentionally postponed (could be a very long time, in our case, the lag is caused by ETLs and can be about 80GB, so postponing the shutdown after all the lag costs a lot of time). And it does not explain to us the timeout change from 60 seconds to 30 minutes (no timeout is mentioned):
* If the server is shut down, checkpointer sends us
* PROCSIG_WALSND_INIT_STOPPING after all regular backends have exited. If
* the backend is idle or runs an SQL query this causes the backend to
* shutdown, if logical replication is in progress all existing WAL records
* are processed followed by a shutdown. Otherwise, this causes the walsender
* to switch to the "stopping" state. In this state, the walsender will reject
* any further replication commands. The checkpointer begins the shutdown
* checkpoint once all walsenders are confirmed as stopping. When the shutdown
* checkpoint finishes, the postmaster sends us SIGUSR2. This instructs
* walsender to send any outstanding WAL, including the shutdown checkpoint
* record, wait for it to be replicated to the standby, and then exit.
Our pipeline requires the instance restart, so far the only workaround we have found is to explicitly disable subscription before initiating shutdown, but it is considered a bit fragile compared to smooth behavior on Pg11.
Is there a way how to make the 30-minute shutdown shorter to become closer to pg11 behavior?
Thanks in advance
Ales Zeleny
pgsql-general by date: